Working with Spark Clusters - Amazon FinSpace

Working with Spark Clusters

FinSpace simplifies how to work with Spark Clusters by offering easy to use cluster configuration templates that allow you to launch, connect, resize and terminate without worrying to manage the underlying infrastructure. Every user in FinSpace with Access Notebooks and Manage Clusters permission can instantiate one cluster.

Note

In order to use notebooks and Spark clusters, you must be a Superuser or a member of a group with necessary permissions - Access Notebooks, Manage Clusters

You can choose one of the following cluster configuration templates:

  • Small

  • Medium

  • Large

  • XLarge

  • 2XLarge

Note

Please note that you are charged by the minute for using the Spark clusters. Please terminate your Spark cluster when you are done using it.

Import FinSpace Cluster Management Library

Use the following code to import the cluster management library in a notebook.

%local from aws.finspace.cluster import FinSpaceClusterManager

Start a Spark Cluster

Use the following procedure to spin up and connect to a Spark cluster

  1. Use the following code to start and connect your notebook to a Spark cluster.

    %local from aws.finspace.cluster import FinSpaceClusterManager finspace_clusters = FinSpaceClusterManager() finspace_clusters.auto_connect()

    The output should be similar to below for a newly created cluster.

    Cluster is starting. It will be operational in approximately 5 to 8 minutes Started cluster with cluster ID: 8x6zd9cq and state: STARTING ...... cleared existing credential location Persisted krb5.conf secret to /etc/krb5.conf re-establishing connection... Persisted keytab secret to /home/sagemaker-user/livy.keytab Authenticated to Spark cluster Persisted Sparkmagic config to /home/sagemaker-user/.Sparkmagic/config.json Started Spark cluster with clusterId: 8x6zd9cq finished reloading all magics & configurations Persisted FinSpace cluster connection info to /home/sagemaker-user/.Sparkmagic/FinSpace_connection_info.json SageMaker Studio Environment is now connected to your FinSpace Cluster: 8x6zd9cq at GMT: 2021-01-15 02:13:50.

    You can expect a startup time of about 5 to 8 minutes when instantiating a cluster for the first time. Once a cluster is created, any newly created notebook will detect and connect to the running cluster when an auto_connect() call is issued and this operation is instantaneous.

List Details for Spark Clusters

Use the following code to list the Spark cluster name and details

%local finspace_clusters.list()

The output should be similar to below

{'clusters': [{'clusterId': '8x6zd9cq', 'clusterStatus': {'state': 'RUNNING', 'reason': 'Started successfully', 'details': ''}, 'name': 'hab-cluster-3e51', 'currentTemplate': 'FinSpace-Small', 'requestedTemplate': 'FinSpace-Small', 'clusterTerminationTime': 1610676314, 'createdTimestamp': 1610676374420, 'modifiedTimestamp': 1610676823805}, {'clusterId': '3ysaqx3g', 'clusterStatus': {'state': 'TERMINATED', 'reason': 'Initiated by user', 'details': ''}, 'name': 'hab-cluster-c4f9', 'currentTemplate': 'FinSpace-Small', 'requestedTemplate': 'FinSpace-Small', 'clusterTerminationTime': 1610478542, 'createdTimestamp': 1610478602457, 'modifiedTimestamp': 1610514182552}]}

In the output above, you can see the clusterID 8x6zd9cq is a SMALL cluster with state equals to RUNNING, and clusterID 3ysaqx3g is a SMALL cluster with state equals to TERMINATED.

Resize Spark Cluster

Scale your Spark cluster up or down based on your compute needs and the volume of data you need to analyze.

Use the following procedure to resize clusters

  1. Type the following code to update your cluster to a Large size.

    %local finspace_clusters.update('8x6zd9cq','Large')

    The output would look like below

    {'clusterId': '8x6zd9cq', 'clusterStatus': {'state': 'UPDATING', 'reason': 'Initiated by user'}}
  2. Note that the update() operation runs asynchronous so that you can continue to work on the cluster as the update operation completes.

  3. Check the status of the update operation using the list() function.

    {'clusters': [{'clusterId': '8x6zd9cq', 'clusterStatus': {'state': 'UPDATING', 'reason': 'Initiated by user', 'details': ''}, 'name': 'hab-cluster-3e51', 'currentTemplate': 'Small', 'requestedTemplate': 'Large', 'clusterTerminationTime': 1610676314, 'createdTimestamp': 1610676374420, 'modifiedTimestamp': 1610682765327}}
  4. In the output above, the clusterID 8x6zd9cq is being updated from a SMALL to LARGE.

Terminate Spark Cluster

Terminate your Spark cluster once your work is done, so that you don’t incur additional charges.

Use the following procedure to terminate your Spark cluster

  1. Type the following code to terminate a cluster.

    %local finspace_clusters.terminate('8x6zd9cq')
  2. You can check the state of the cluster using the list() function.