Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility
AWS Whitepaper

Transforming Data Assets

One of the core values of a data lake is that it is the collection point and repository for all of an organization’s data assets, in whatever their native formats are. This enables quick ingestion, elimination of data duplication and data sprawl, and centralized governance and management. After the data assets are collected, they need to be transformed into normalized formats to be used by a variety of data analytics and processing tools.

The key to ‘democratizing’ the data and making the data lake available to the widest number of users of varying skill sets and responsibilities is to transform data assets into a format that allows for efficient ad hoc SQL querying. As discussed earlier, when a data lake is built on AWS, we recommend transforming log-based data assets into Parquet format. AWS provides multiple services to quickly and efficiently achieve this.

There are a multitude of ways to transform data assets, and the “best” way often comes down to individual preference, skill sets, and the tools available. When a data lake is built on AWS services, there is a wide variety of tools and services available for data transformation, so you can pick the methods and tools that you are most comfortable with. Since the data lake is inherently multi-tenant, multiple data transformation jobs using different tools can be run concurrently.

The two most common and straightforward methods to transform data assets into Parquet in an Amazon S3-based data lake use Amazon EMR clusters. The first method involves creating an EMR cluster with Hive installed using the raw data assets in Amazon S3 as input, transforming those data assets into Hive tables, and then writing those Hive tables back out to Amazon S3 in Parquet format. The second, related method is to use Spark on Amazon EMR. With this method, a typical transformation can be achieved with only 20 lines of PySpark code.

A third, simpler data transformation method on an Amazon S3-based data lake is to use AWS Glue. AWS Glue is an AWS fully managed extract, transform, and load (ETL) service that can be directly used with data stored in Amazon S3. AWS Glue simplifies and automates difficult and time-consuming data discovery, conversion, mapping, and job scheduling tasks. AWS Glue guides you through the process of transforming and moving your data assets with an easy-to-use console that helps you understand your data sources, transform and prepare these data assets for analytics, and load them reliably from Amazon S3 data sources back into Amazon S3 destinations.

AWS Glue automatically crawls raw data assets in your data lake’s S3 buckets, identifies data formats, and then suggests schemas and transformations so that you don’t have to spend time hand-coding data flows. You can then edit these transformations, if necessary, using the tools and technologies you already know, such as Python, Spark, Git, and your favorite integrated developer environment (IDE), and then share them with other AWS Glue users of the data lake. AWS Glue’s flexible job scheduler can be set up to run data transformation flows on a recurring basis, in response to triggers, or even in response to AWS Lambda events.

AWS Glue automatically and transparently provisions hardware resources and distributes ETL jobs on Apache Spark nodes so that ETL run times remain consistent as data volume grows. AWS Glue coordinates the execution of data lake jobs in the right sequence, and automatically re-tries failed jobs. With AWS Glue, there are no servers or clusters to manage, and you pay only for the resources consumed by your ETL jobs.