Domain 3: Modeling (36% of the exam content)
This domain accounts for 36% of the exam content.
Topics
Task 3.1: Frame business problems as ML problems
Determine when to use and when not to use ML.
Know the difference between supervised and unsupervised learning.
Select from among classification, regression, forecasting, clustering, recommendation, and foundation models.
Task 3.2: Select the appropriate model(s) for a given ML problem
XGBoost, logistic regression, k-means, linear regression, decision trees, random forests, RNN, CNN, ensemble, transfer learning, and large language models (LLMs)
Express the intuition behind models.
Task 3.3: Train ML models
Split data between training and validation (for example, cross validation).
Understand optimization techniques for ML training (for example, gradient descent, loss functions, convergence).
-
Choose appropriate compute resources (for example GPU or CPU, distributed or non-distributed).
Choose appropriate compute platforms (Spark or non-Spark).
-
Update and retrain models.
Batch or real-time/online
Task 3.4: Perform hyperparameter optimization
-
Perform regularization.
Dropout
L1/L2
Perform cross-validation.
Initialize models.
Understand neural network architecture (layers and nodes), learning rate, and activation functions.
Understand tree-based models (number of trees, number of levels).
Understand linear models (learning rate).
Task 3.5: Evaluate ML models
-
Avoid overfitting or underfitting.
Detect and handle bias and variance.
Evaluate metrics (for example, area under curve [AUC]-receiver operating characteristics [ROC], accuracy, precision, recall, Root Mean Square Error [RMSE], F1 score).
Interpret confusion matrices.
Perform offline and online model evaluation (A/B testing).
Compare models by using metrics (for example, time to train a model, quality of model, engineering costs).
Perform cross-validation.