Implementation Considerations - Real-Time Analytics with Spark Streaming

Implementation Considerations

Application Requirements

The Real-Time Analytics solution requires a working Spark Streaming application written in Java or Scala. We recommend that you use the latest version of Apache Spark for your application. You can choose to deploy your application as a JAR file or a JSON file (see the next section for details).

Processing Engine

When you deploy the solution, you choose the processing engine for your custom Spark Streaming application: a JAR file or an Apache Zeppelin notebook JSON file.

Application JAR

If you choose to package your custom Spark Streaming application as a JAR file, the solution requires a Spark Submit command to launch your application on the Amazon EMR cluster. You can choose to upload a file with the submit script, or you can enter the command directly in the AWS CloudFormation template parameter when you launch the solution.

Apache Zeppelin

If you choose to use Zeppelin, you will upload your custom Spark Streaming application as a notebook.json file from your local machine or a URL you specify. The solution automatically creates the dependencies and configurations to visualize your real-time and batch data.

Demo Application

The solution includes an additional AWS CloudFormation template real-time-analytics-spark-streaming-demo.template that deploys a demo application for testing purposes.

If you choose to run the demo application, this solution deploys a sample Spark Streaming application on the Amazon EMR cluster, a sample Kinesis stream, and a sample data producer that sends sample data to your Kinesis stream. The demo application is packaged as a JAR file, but it also includes a JSON file you can use to deploy the demo application through the Zeppelin UI. For more information, see Appendix A.

Single Application Deployment

The solution is designed to work with only one Spark Streaming application at a time. If you want to change applications, you must first stop the running application and then deploy the solution again with a new application. This also applies if you deploy the demo application: you must stop the running demo application before you can deploy the demo application JSON file, a custom JAR file or a custom JSON file.

If you want to upload an application file that uses the same name as an application from a previous deployment (for example, a new version of a JSON template), you must clear the application name from the Amazon DynamoDB table before you deploy solution. See Tracking Amazon Kinesis Data Streams Application State in the Amazon Kinesis Data Streams Developer Guide for more information.

EMR Web Interfaces

When you launch an Amazon EMR cluster in a public subnet, the master node of the cluster has a public DNS which allows you to create an SSH tunnel and securely access the Amazon EMR web interfaces. Because this solution deploys the Amazon EMR cluster in a private subnet, the master node will not have a public DNS for secure SSH access. To allow you to access the Amazon EMR web interfaces, this solution deploys a bastion host with a public IP address. You must configure dynamic port forwarding to connect to the bastion host. For more information, see View Web Interfaces Hosted on Amazon EMR Clusters in the Amazon EMR Management Guide.

Memory-Optimized Instance Types

We recommend memory-optimized Amazon Elastic Compute Cloud (Amazon EC2) instance types for Apache Spark workloads because Spark attempts to process as much data in memory as possible. By default, this solution deploys an r3.xlarge instance for the Amazon EMR cluster nodes to deliver optimal performance.

Real-Time Data Visualization

For users who choose Zeppelin as their processing engine for this solution, Zeppelin will display visualizations of your real-time data as it flows. These visualizations can be shared or published to external dashboards. For more information, go to the Apache Zeppelin website.

Regional Deployment

The Real-Time Analytics solution uses AWS Lambda during initial configuration or when resources are updated or deleted. Therefore, you must deploy this solution in an AWS Region that supports AWS Lambda. For the most current AWS Lambda availability by region, see AWS service offerings by region.