Implementation Considerations - Machine Learning for Telecommunication

Implementation Considerations

The Jupyter Notebook

The Machine Learning for Telecommunication solution uses The Jupyter Notebook, an open source web application that allows you to create and share live code, equations, visualizations, and narrative text.

The custom Jupyter notebooks develop a machine learning (ML) model that can identify and predict insights such as call-characteristic anomalies. The notebooks use a default set of predefined, call detail record (CDR) related features to train the model, test the model, then compute the area under the curve (AUC) metric based on the model scores on the test data, plot the receiver operating characteristic (ROC) curve to better understand the model performance, and then report the false positives, false negatives, true positives, and true negatives using the confusion matrix. Users can modify the notebook code and features to create different ML models and related visualizations for their specific use cases.


Five Jupyter notebooks are included in the solution which can be used with your datasets to solve a business problem or use case.

Ml-Telecom-NaiveBayes.ipynb: This notebook demonstrates the exploration of CDR data and classification with a Apache Spark ML Naïve Bayes Algorithm. The notebook reads the Parquet data with Amazon S3 Select for performance optimization, and showcases visualization of data exploration with bar charts and box plots. Additionally, it classifies and predicts the probability of call disconnect reason code 16, and further displays the ML evaluation steps with the confusion matrix and cross validation.

Ml-Telecom-PCA-KMeans.ipynb: This notebook demonstrates unsupervised learning with Apache Spark ML on CDR data with principal component analysis (PCA) and k-means. The notebook reads the Parquet data with Amazon S3 Select for performance optimization, showcases feature extraction with PCA and an elbow graph for optimizing the choice of k for k-means clustering. Additionally, it further displays the exploration of the data features with a correlation matrix heat map, visualizes the k-means clustering data in 3D and 2D, and further evaluates the k-means cluster by creating centriods.

Ml-Telecom-RandomForestClassifier.ipynb: This notebook uses a supervised learning algorithm of the Apache Spark ML RandomForest Classifier for classifying the call disconnect reason in CDR Data. The notebook reads the Parquet data with Amazon S3 Select for performance optimization, showcases exploratory data analysis with the ML concepts of feature selection and correlation heat map matrix. Additionally, it uses ChiSqSelector for figuring feature importances, and further evaluates the model with a confusion matrix, AUC, ROC, and precision recall metrics.

Ml-Telecom-TimeSeries-RandomForestClassifier-DeepAR.ipynb: This notebook demonstrates the DeepAR supervised learning algorithm for forecasting Scalar Time-Series with telecom data. The notebook uses a hybrid approach of using Apache Spark ML for classifying the call disconnect reason as an anomaly and Amazon DeepAR for time series prediction of the call disconnect reason anomaly. Additionally, it further demonstrates how to prepare the dataset for the time series training with DeepAR and how to use the trained model for inference, and uses the context length and prediction length in minutes interval to showcase the time series prediction graph.

Ml-Telecom-RandomCutForest.ipynb: This notebook demonstrates unsupervised anomaly detection on time series data with the Amazon SageMaker Random Cut Forest algorithm. The notebook uses the call service duration of CDR data for anomaly detection. Additionally, it further showcases data exploration by ploting the data, hyper parameter tuning and calculating anomaly scores on the data, and displays concepts of data shingling and prediction.

AWS Glue

The Machine Learning for Telecommunication solution invokes an AWS Glue job during the solution deployment to process the synthetic call detail record (CDR) data or the customer’s data to convert from CSV to Parquet format. By default, the AWS Glue job deploys 10 data processing units (DPUs) for preprocessing and can be scheduled with a scheduler. Note that when you use your own dataset, you need to modify the schema definitions to meet your data attributes to enable the AWS Glue job to run successfully.

Demo Data

This solution includes synthetic demo IP Data Record (IPDR) datasets in Abstract Syntax Notation One (ASN.1) format and call detail record (CDR) format. You can choose to copy these datasets into the source bucket to run the demo of AWS Glue transformations and ML model predictions, or use your own datasets.

The sample datasets descriptions in the Call Detail Record are as follows:

  • Call Detail Record – Start: Indicates that the call has been successfully established/connected and a session has successfully started.

  • Call Detail Record – Stop: Indicates that records that previously established a successful session are now terminated.

  • Call Detail Record - Start Sample: A subset of the CDR start dataset

  • Call Detail Record – Stop Sample: A subset of the CDR stop dataset

Amazon SageMaker Instance Types

This solution features three instance types for Amazon SageMaker (ml.t2.medium, ml.m4.xlarge, ml.p2.xlarge). By default, the solution deploys an ml.t2.medium instance. Note that the solution can provision an ml.p2.xlarge instance using the Notebook Instance Type template parameter. This instance allows Jupyter notebooks with large amounts of data to be processed faster, but will incur additional charges.