AWS makes it easier to build and populate data lakes

Amazon has gone live with AWS Lake Formation, a service that can cut data lake set-up from months to days.

Data Lakes are large scale collections of data which are used for analysis runs to discover fresh information and inter-relationships and make the organisation more efficient.

AWS said building and filling data lakes can take months because an organisation has to provision and configure storage, and then copy data into the storage, typically from different sources and with different data types (schemas). A data catalogue must be set up and organised so that analytic runs can be made against data subsets.

Raju Gulabani, AWS VP for databases, analytics, and machine learning, said in a statement: “AWS hosts more data lakes than anyone else – with tens of thousands and growing every day. They’ve also told us that they want it to be easier and faster to set up and manage their data lakes.”

Filling the data lake

AWS Lake Formation automates storage provision and provides templates for data ingest. It can automatically inspect data elements to extract schemas and metadata, build a catalogue for search, and partition the data. The service can transform the data into formats such as Apache Parquet and ORC that are good for analytics processes.

Lake Formation can enforce access control and security policies, and provide a central point of management. Data can be selected for analysis by Amazon Redshift, Athena, and AWS Glue. Amazon EMR, QuickSight, and SageMaker will be supported in the next few months.

AWS Lake Formation is available today in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland) with additional regions coming soon.