Configuration notes - Analytics Lens

Configuration notes

  • Machine Learning data often needs to be cleansed of missing fields, labeled, and features “engineered” to reduce or combine extraneous columns/features. Additionally, ML training often needs only a subset of the data available, so a down-sampling method is often chosen to speed model training and thereby shorten cycles of learning/refinement.

  • During model preparation, data consistency is important so model evaluation can be compared against previous versions. A training dataset and a “hold-out” or validation dataset may be stored separately for model training and evaluation, respectively.

  • For cost optimization of compute resources, select the right instance type for training and making inferences. For example, a GPU-optimized instance may speed training neural-network models, but a general-purpose instance is sufficient for serving API inferences from the NN model.

  • Utilizing fan-out resources such as Amazon SageMaker or SPOT market resources, multiple versions of the same model can be trained simultaneously with different tuning parameters, known as hyper-parameter optimization, vastly accelerating the development cycle.

  • Data Scientists often have their preferred method of developing models. Notebook-style IDEs such as Jupyter Notebook, Zeppelin, or R can present the IDE in the user's local browser, while commands are executed on remote resources within the AWS Cloud.

  • Try to avoid moving datasets frequently between the cloud and the data scientist's local workstation. A preferred pattern is to leave all data at rest in Amazon S3, load data as needed for development, retaining ETL pipelines.

  • Utilizing API Gateway, Amazon SageMaker, or ECS for deploying your models allows routing incoming traffic to different versions of the ML model, thus verifying performance in production before routing all traffic to the new version.