Data science experimentation environment - Build a Secure Enterprise Machine Learning Platform on AWS

Data science experimentation environment

Data scientists and ML engineers use the experimentation environment for individual or team-based experimentations for the different science projects. The environment needs to provide services and tools for data querying and analysis, code authoring, data processing, model training and tuning, container development, model hosting testing, source code control, and data science library package access.

Data science services

The following services can be provisioned in the experimentation environment in the data science account:

Amazon Athena — Data scientists and ML engineers can use Amazon Athena to query data in the data lake directly using SQL language. Athena provides an easy way to quickly identify and analyze a large amount of data directly against data stored in S3.

Amazon SageMaker Notebook Instance and SageMaker Studio — Data scientists and ML engineers can use a SageMaker Notebook instance or SageMaker Studio to author code, prepare data, and run model training directly inside a Jupyter environment. SageMaker Notebook instances and SageMaker Studio come pre-installed with a list of library packages and kernels (such as TensorFlow, Scikit-learn, and PyTorch) for easy model development.

If you use a SageMaker Notebook instance for code authoring and experimentation, the SageMaker Notebook instance provides Lifecycle script support, which can be used for:

  • Setting up environment variables such as VPC, Security Group, and KMS keys

  • Configuring the Code Repo connection

  • Configuring a connection to the internal package management server (such as ArtiFactory or CodeArtifact)

This sample CloudFormation script creates a Lifecycle configuration that sets up environment variables, configures a GitHub connection, and configure a PyPi mirror server. If you use SageMaker Studio, you can use the SageMaker Studio custom image to set up connection to private package management server. A SageMaker Studio custom image allows you to create a Docker image with your own selections of kernels, language packages, and other files to run a Jupyter notebook in SageMaker Studio. For more information, see Private package installation in Amazon SageMaker running in internet-free mode.

To use SageMaker Notebook instances or Studio in an enterprise environment, data scientists often need to provide infrastructure configuration information such as VPC configurations, KMS keys, and IAM roles for processing, training and hosing. To pass configurations to SageMaker training, processing jobs or model endpoints, consider using AWS Systems Manager Parameter Store to store these parameters in an encrypted fashion, and use a Python script to call these parameters via APIs. The Python script can be loaded onto the SageMaker notebook instance at startup using lifecycle configurations, or in a SageMaker Studio custom image.

  • Amazon SageMaker Data Wrangler (Data Wrangler) Data Wrangler is a feature of SageMaker Studio to import, transform, visualize, and analyze data. Data scientists can use Data Wrangler to perform data preparation tasks such as plotting histogram and scatter charts against datasets, running data transformations such as one-hot encoding, or handing data outliers.

  • Amazon SageMaker Processing — Data scientists and ML engineers can use SageMaker processing for large data processing jobs. SageMaker processing provides built-in open-source containers for Scikit-learn and Spark. Data scientists can also bring custom containers to run processing jobs.

  • Amazon SageMaker Feature Store SageMaker Feature Store can help data scientists share common data features with other data scientists across teams for model training and inference. SageMaker Feature Store supports both offline feature store for training and online feature store for online inferencing.

  • Amazon SageMaker Training / Tuning service — For model training and tuning, SageMaker provides fully managed model training and tuning services. It provides a list of built-in algorithms for different machine learning tasks such as classification, regression, clustering, computer vision, natural language processing, time series, and anomaly detection. It also provides a list of fully managed training open-source containers for TensorFlow, PyTorch, Apache MXNet, and Scikit-learn. Custom training containers can also be used for model training and tuning.

  • Amazon SageMaker Clarify (SageMaker Clarify) — Data scientists and ML engineers can use SageMaker Clarify to compute pre-training and post-training bias metrics and feature attribution for explainability.

  • Amazon SageMaker Hosting — Data scientists and ML engineers can test model deployment and real-time inference using the SageMaker hosting service. Models trained using the SageMaker built-in algorithms and managed containers can be deployed quickly using a single API command. Custom model inference containers can also be brought in to host custom models.

  • Amazon SageMaker Pipelines SageMaker Pipelines is a fully managed CI/CD service for machine learning. It can be used to automate various steps of the ML workflow such as data processing/transformation, training and tuning, and model deployment.

  • AWS Step FunctionsAWS Step Functions is a fully managed workflow orchestration tool. It comes with a data science SDK that provides easy integration of SageMaker services such as processing, training, tuning, and hosting. Data scientists and ML engineers can use AWS Step Functions to build workflow pipelines to automate the different steps (such as data processing and model training) in the experimentation environments.

  • Code repository — A code repository such as Bitbucket or CodeCommit should be provided to data scientists and ML engineers for code management and version control. The code repository can reside in the Shared Services account or on-premises, and it is accessible from the data science account.

  • Amazon ECR (ECR)ECR is used to store training, processing, and inference containers. Data scientists and ML engineers can use ECR in the data science account to manage custom containers for experimentation.

  • Artifacts repository — Organizations with strict internet access control often do not allow its users to download and install library packages from public package repositories directly, such as the Python Package Index (PyPi) or Anaconda. Private package repositories such as Artifactory, AWS CodeArtifact, or mirroring PyPI servers can be created to support private packages management. These servers can be used to host private packages as well as a mirroring site for public package sites such as the PyPi for Pip or Anaconda main package channel and Conda-forge channel for Anaconda.

          A diagram showing core components in the experimentation environment.

Core components in the experimentation environment

Enabling self-service

To improve onboarding efficiency for data scientists and ML engineers, consider developing a self-service capability using the AWS Service Catalog. The AWS Service Catalog enables you to create self-service portfolio and products using CloudFormation scripts, and data scientists can directly request access to SageMaker Notebook / SageMaker Studio and other related AWS services without going through manual provisioning. Follow the instructions in Enable self-service, secured data science using Amazon SageMaker notebooks and AWS Service Catalog to enable self-service for data science products.

          A diagram that shows enabling self-service for data science products with AWS Service Catalog.

Enabling self-service for data science products with AWS Service Catalog