Data collection - Machine Learning Lens

Data collection

Important steps in the ML lifecycle are to identify the data needed, followed by the evaluation of the various means available for collecting that data to train your model.


      Figure 8 includes the key components of data collection phase. 
        These components include: labeling, ingesting as streaming or batch, and aggregating.

Figure 8: The main components of data collection

The main components of the data collection phase (shown in Figure 8) are as follows:

  • Label - Labeled data is a group of samples that have been tagged with one or more labels. If labels are missing, then some effort is required to label it (either manual or automated).

  • Ingest and aggregate - Data collection includes ingesting and aggregating data from multiple data sources.


      Figure 9 includes the data sources (like IoT, sensor, timeseries), data ingestion systems 
        (like data streaming and analytics steaming), and data technologies (like data pipelines, 
        data catalog, data lake, databases, and data warehouses).

Figure 9: Data sources, data ingestion, and data technologies

The sub-components of the ingest and aggregate component (shown in Figure 9) are as follows:

  • Data sources - Data sources include time-series, events, sensors, IoT devices, and social networks, depending on the nature of the use case. You can enrich your data sources by using the geospatial capability of Amazon SageMaker to access a range of geospatial data sources from AWS (for example, Amazon Location Service), open-source datasets (for example, Open Data on AWS), or your own proprietary data including from third-party providers (such as Planet Labs). To learn more about the geospatial capability in Amazon SageMaker, visit Geospatial ML with Amazon SageMaker (Preview).

  • Data ingestion- Data ingestion processes and technologies capture and store data on storage media. Data ingestion can occur in real-time using streaming technologies or historical mode using batch technologies.

  • Data technologies - Data storage technologies vary from transactional (SQL) databases, to data lakes and data warehouses. Extract, transform, and load (ETL) pipeline technology automates and orchestrates the data movement and transformations across cloud services and resources. A data lake technology enables storing and analyzing structured and unstructured data.