Challenges in building a data pipeline - AWS Glue Best Practices: Building an Operationally Efficient Data Pipeline

Challenges in building a data pipeline

Building a well-architected and high performing data pipeline requires upfront planning and design of multiple aspects of data storage, including data structure, schema design, schema change handling, storage optimization, and quick scaling to meet the unexpected increase in application data volume and so on. This often requires an ETL mechanism that is designed to orchestrate the transformation of data in multiple steps. You also need to ensure that the ingested data is validated for the data quality or data loss, and monitored for job failures and data exceptions that are not handled with ETL job design.

Here are some common challenges a data engineer typically faces:

  • Increase in data volume for processing

  • Change of structure of source data

  • Poor data quality

  • Poor data integrity in source data

  • Duplicate data

  • Timeliness of source data files

  • Lack of available developer interface for testing