使用 Apache Spark 创建集群

以下程序在 Amazon EMR 控制台中使用快速选项 创建一个安装了 Spark 的集群。

作为替代，您可以使用 Advanced Options (高级选项) 进一步自定义您的集群设置，或是提交步骤以编程方式安装应用程序，然后执行自定义应用程序。利用这些集群创建选项之一，您可以选择使用 AWS Glue 作为您的 Spark SQL 元存储。请参阅在亚马逊 EMR 上使用 AWS Glue 数据目录和 Spark了解更多信息。

启动安装了 Spark 的集群

在 /emr 上打开亚马逊 EMR 控制台。https://console.aws.amazon.com
选择 Create cluster (创建集群) 以使用 Quick Options (快速选项)。
输入 Cluster name (集群名称)。集群名称不能包含字符 <、>、$、| 或 `（反引号）。
在 Software Configuration (软件配置) 中，选择 Release (发行版) 选项。
在 Applications (应用程序) 中，选择 Spark 应用程序捆绑包。
根据需要选择其它选项，然后选择 Create cluster (创建集群)。

注意
要在创建集群时配置 Spark，请参阅配置 Spark。

要启动安装了 Spark 的集群，请使用 AWS CLI

使用下面的命令创建集群。


aws emr create-cluster --name "Spark cluster" --release-label emr-7.10.0 --applications Name=Spark \
--ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 --use-default-roles

注意

为了便于读取，包含 Linux 行继续符（\）。它们可以通过 Linux 命令删除或使用。对于 Windows，请将它们删除或替换为脱字号（^）。

使用 SDK for Java 启动安装了 Spark 的集群

通过 SupportedProductConfig 中使用的 RunJobFlowRequest 指定 Spark 作为应用程序。

下面的实例显示如何通过 Java 使用 Spark 创建集群：



import com.amazonaws.AmazonClientException;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduce;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClientBuilder;
import com.amazonaws.services.elasticmapreduce.model.*;
import com.amazonaws.services.elasticmapreduce.util.StepFactory;

public class Main {

        public static void main(String[] args) {
                AWSCredentials credentials_profile = null;
                try {
                        credentials_profile = new ProfileCredentialsProvider("default").getCredentials();
                } catch (Exception e) {
                        throw new AmazonClientException(
                                        "Cannot load credentials from .aws/credentials file. " +
                                                        "Make sure that the credentials file exists and the profile name is specified within it.",
                                        e);
                }

                AmazonElasticMapReduce emr = AmazonElasticMapReduceClientBuilder.standard()
                                .withCredentials(new AWSStaticCredentialsProvider(credentials_profile))
                                .withRegion(Regions.US_WEST_1)
                                .build();

                // create a step to enable debugging in the AWS Management Console
                StepFactory stepFactory = new StepFactory();
                StepConfig enabledebugging = new StepConfig()
                                .withName("Enable debugging")
                                .withActionOnFailure("TERMINATE_JOB_FLOW")
                                .withHadoopJarStep(stepFactory.newEnableDebuggingStep());

                Application spark = new Application().withName("Spark");

                RunJobFlowRequest request = new RunJobFlowRequest()
                                .withName("Spark Cluster")
                                .withReleaseLabel("emr-5.20.0")
                                .withSteps(enabledebugging)
                                .withApplications(spark)
                                .withLogUri("s3://path/to/my/logs/")
                                .withServiceRole("EMR_DefaultRole")
                                .withJobFlowRole("EMR_EC2_DefaultRole")
                                .withInstances(new JobFlowInstancesConfig()
                                                .withEc2SubnetId("subnet-12ab3c45")
                                                .withEc2KeyName("myEc2Key")
                                                .withInstanceCount(3)
                                                .withKeepJobFlowAliveWhenNoSteps(true)
                                                .withMasterInstanceType("m4.large")
                                                .withSlaveInstanceType("m4.large"));
                RunJobFlowResult result = emr.runJobFlow(request);
                System.out.println("The cluster ID is " + result.toString());
        }
}

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

Spark

在 Amazon EMR 6.x 上使用 Docker 运行 Spark 应用程序