Domain 2: Exploratory Data Analysis (24% of the exam content)
This domain accounts for 24% of the exam content.
Topics
Task 2.1: Sanitize and prepare data for modeling
Identify and handle missing data, corrupt data, and stop words.
Format, normalize, augment, and scale data.
-
Determine whether there is sufficient labeled data.
Identify mitigation strategies.
Use data labelling tools (for example, Amazon Mechanical Turk).
Task 2.2: Perform feature engineering
Identify and extract features from datasets, including from data sources such as text, speech, images, and public datasets.
Analyze and evaluate feature engineering concepts (for example, binning, tokenization, outliers, synthetic features, one-hot encoding, reducing dimensionality of data).
Task 2.3: Analyze and visualize data for ML
Create graphs (for example, scatter plots, time series, histograms, box plots).
Interpret descriptive statistics (for example, correlation, summary statistics, p-value).
Perform cluster analysis (for example, hierarchical, diagnosis, elbow plot, cluster size).