Buat cluster dengan Apache Spark

Mode fokus

Buat cluster dengan Apache Spark - Amazon EMR

Prosedur berikut membuat cluster dengan Spark diinstal menggunakan Opsi Cepat di konsol EMR Amazon.

Anda dapat menggunakan alternatifOpsi lanjutanuntuk lebih menyesuaikan setup cluster Anda, atau untuk mengirimkan langkah-langkah untuk pemrograman menginstal aplikasi dan kemudian menjalankan aplikasi kustom. Dengan salah satu pilihan pembuatan cluster, Anda dapat memilih untuk menggunakan AWS Glue sebagai metastore Spark SQL Anda. Lihat Gunakan katalog Katalog Data AWS Glue dengan Spark di Amazon EMR untuk informasi selengkapnya.

Untuk melancarkan kluster dengan Spark dipasang

Buka konsol EMR Amazon di https://console.aws.amazon.com /emr.
PilihBuat gugusUntuk menggunakanOptions cepat.
Masukkan nama Nama kluster. Nama cluster Anda tidak dapat berisi karakter <, >, $, |, atau `(backtick).
UntukKonfigurasi Software, pilihRilisPilihan.
UntukAplikasi, pilihSparkbundel aplikasi.
Pilih opsi lain yang diperlukan, lalu pilih Buat kluster.

catatan
Untuk mengkonfigurasi Spark saat Anda membuat cluster, lihat Konfigurasi Spark.

Untuk meluncurkan cluster dengan Spark diinstal menggunakan AWS CLI

klaster dengan perintah berikut.


aws emr create-cluster --name "Spark cluster" --release-label emr-7.8.0 --applications Name=Spark \
--ec2-attributes KeyName=myKey --instance-type m5.xlarge --instance-count 3 --use-default-roles

catatan

Karakter kelanjutan baris Linux (\) disertakan untuk memudahkan pembacaan. Karakter ini bisa dihapus atau digunakan dalam perintah Linux. Untuk Windows, hapus atau ganti dengan tanda sisipan (^).

Untuk meluncurkan klaster dengan Spark diinstal menggunakan SDK for Java

Tentukan Spark sebagai aplikasi dengan SupportedProductConfig digunakan dalam RunJobFlowRequest.

Contoh berikut menunjukkan cara membuat cluster dengan Spark menggunakan Java.



import com.amazonaws.AmazonClientException;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduce;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClientBuilder;
import com.amazonaws.services.elasticmapreduce.model.*;
import com.amazonaws.services.elasticmapreduce.util.StepFactory;

public class Main {

        public static void main(String[] args) {
                AWSCredentials credentials_profile = null;
                try {
                        credentials_profile = new ProfileCredentialsProvider("default").getCredentials();
                } catch (Exception e) {
                        throw new AmazonClientException(
                                        "Cannot load credentials from .aws/credentials file. " +
                                                        "Make sure that the credentials file exists and the profile name is specified within it.",
                                        e);
                }

                AmazonElasticMapReduce emr = AmazonElasticMapReduceClientBuilder.standard()
                                .withCredentials(new AWSStaticCredentialsProvider(credentials_profile))
                                .withRegion(Regions.US_WEST_1)
                                .build();

                // create a step to enable debugging in the AWS Management Console
                StepFactory stepFactory = new StepFactory();
                StepConfig enabledebugging = new StepConfig()
                                .withName("Enable debugging")
                                .withActionOnFailure("TERMINATE_JOB_FLOW")
                                .withHadoopJarStep(stepFactory.newEnableDebuggingStep());

                Application spark = new Application().withName("Spark");

                RunJobFlowRequest request = new RunJobFlowRequest()
                                .withName("Spark Cluster")
                                .withReleaseLabel("emr-5.20.0")
                                .withSteps(enabledebugging)
                                .withApplications(spark)
                                .withLogUri("s3://path/to/my/logs/")
                                .withServiceRole("EMR_DefaultRole")
                                .withJobFlowRole("EMR_EC2_DefaultRole")
                                .withInstances(new JobFlowInstancesConfig()
                                                .withEc2SubnetId("subnet-12ab3c45")
                                                .withEc2KeyName("myEc2Key")
                                                .withInstanceCount(3)
                                                .withKeepJobFlowAliveWhenNoSteps(true)
                                                .withMasterInstanceType("m4.large")
                                                .withSlaveInstanceType("m4.large"));
                RunJobFlowResult result = emr.runJobFlow(request);
                System.out.println("The cluster ID is " + result.toString());
        }
}