AWS Glue 使用 SDK for PHP 範例 - AWS SDK for PHP

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

AWS Glue 使用 SDK for PHP 範例

下列程式碼範例說明如何使用 AWS SDK for PHP 與來執行動作及實作常見案例 AWS Glue。

Actions 是大型程式的程式碼摘錄,必須在內容中執行。雖然動作會告訴您如何呼叫個別服務函數,但您可以在其相關情境和跨服務範例中查看內容中的動作。

Scenarios (案例) 是向您展示如何呼叫相同服務中的多個函數來完成特定任務的程式碼範例。

每個範例都包含一個連結 GitHub,您可以在其中找到如何在內容中設定和執行程式碼的指示。

動作

下列程式碼範例會示範如何使用CreateCrawler

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

$crawlerName = "example-crawler-test-" . $uniqid; $role = $iamService->getRole("AWSGlueServiceRole-DocExample"); $path = 's3://crawler-public-us-east-1/flight/2016/csv'; $glueService->createCrawler($crawlerName, $role['Role']['Arn'], $databaseName, $path); public function createCrawler($crawlerName, $role, $databaseName, $path): Result { return $this->customWaiter(function () use ($crawlerName, $role, $databaseName, $path) { return $this->glueClient->createCrawler([ 'Name' => $crawlerName, 'Role' => $role, 'DatabaseName' => $databaseName, 'Targets' => [ 'S3Targets' => [[ 'Path' => $path, ]] ], ]); }); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考CreateCrawler中的。

下列程式碼範例會示範如何使用CreateJob

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

$role = $iamService->getRole("AWSGlueServiceRole-DocExample"); $jobName = 'test-job-' . $uniqid; $scriptLocation = "s3://$bucketName/run_job.py"; $job = $glueService->createJob($jobName, $role['Role']['Arn'], $scriptLocation); public function createJob($jobName, $role, $scriptLocation, $pythonVersion = '3', $glueVersion = '3.0'): Result { return $this->glueClient->createJob([ 'Name' => $jobName, 'Role' => $role, 'Command' => [ 'Name' => 'glueetl', 'ScriptLocation' => $scriptLocation, 'PythonVersion' => $pythonVersion, ], 'GlueVersion' => $glueVersion, ]); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考CreateJob中的。

下列程式碼範例會示範如何使用DeleteCrawler

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

echo "Delete the crawler.\n"; $glueClient->deleteCrawler([ 'Name' => $crawlerName, ]); public function deleteCrawler($crawlerName) { return $this->glueClient->deleteCrawler([ 'Name' => $crawlerName, ]); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考DeleteCrawler中的。

下列程式碼範例會示範如何使用DeleteDatabase

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

echo "Delete the databases.\n"; $glueClient->deleteDatabase([ 'Name' => $databaseName, ]); public function deleteDatabase($databaseName) { return $this->glueClient->deleteDatabase([ 'Name' => $databaseName, ]); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考DeleteDatabase中的。

下列程式碼範例會示範如何使用DeleteJob

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

echo "Delete the job.\n"; $glueClient->deleteJob([ 'JobName' => $job['Name'], ]); public function deleteJob($jobName) { return $this->glueClient->deleteJob([ 'JobName' => $jobName, ]); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考DeleteJob中的。

下列程式碼範例會示範如何使用DeleteTable

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

echo "Delete the tables.\n"; foreach ($tables['TableList'] as $table) { $glueService->deleteTable($table['Name'], $databaseName); } public function deleteTable($tableName, $databaseName) { return $this->glueClient->deleteTable([ 'DatabaseName' => $databaseName, 'Name' => $tableName, ]); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考DeleteTable中的。

下列程式碼範例會示範如何使用GetCrawler

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

echo "Waiting for crawler"; do { $crawler = $glueService->getCrawler($crawlerName); echo "."; sleep(10); } while ($crawler['Crawler']['State'] != "READY"); echo "\n"; public function getCrawler($crawlerName) { return $this->customWaiter(function () use ($crawlerName) { return $this->glueClient->getCrawler([ 'Name' => $crawlerName, ]); }); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考GetCrawler中的。

下列程式碼範例會示範如何使用GetDatabase

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

$databaseName = "doc-example-database-$uniqid"; $database = $glueService->getDatabase($databaseName); echo "Found a database named " . $database['Database']['Name'] . "\n"; public function getDatabase(string $databaseName): Result { return $this->customWaiter(function () use ($databaseName) { return $this->glueClient->getDatabase([ 'Name' => $databaseName, ]); }); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考GetDatabase中的。

下列程式碼範例會示範如何使用GetJobRun

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

$jobName = 'test-job-' . $uniqid; $outputBucketUrl = "s3://$bucketName"; $runId = $glueService->startJobRun($jobName, $databaseName, $tables, $outputBucketUrl)['JobRunId']; echo "waiting for job"; do { $jobRun = $glueService->getJobRun($jobName, $runId); echo "."; sleep(10); } while (!array_intersect([$jobRun['JobRun']['JobRunState']], ['SUCCEEDED', 'STOPPED', 'FAILED', 'TIMEOUT'])); echo "\n"; public function getJobRun($jobName, $runId, $predecessorsIncluded = false): Result { return $this->glueClient->getJobRun([ 'JobName' => $jobName, 'RunId' => $runId, 'PredecessorsIncluded' => $predecessorsIncluded, ]); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考GetJobRun中的。

下列程式碼範例會示範如何使用GetJobRuns

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

$jobName = 'test-job-' . $uniqid; $jobRuns = $glueService->getJobRuns($jobName); public function getJobRuns($jobName, $maxResults = 0, $nextToken = ''): Result { $arguments = ['JobName' => $jobName]; if ($maxResults) { $arguments['MaxResults'] = $maxResults; } if ($nextToken) { $arguments['NextToken'] = $nextToken; } return $this->glueClient->getJobRuns($arguments); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考GetJobRuns中的。

下列程式碼範例會示範如何使用GetTables

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

$databaseName = "doc-example-database-$uniqid"; $tables = $glueService->getTables($databaseName); public function getTables($databaseName): Result { return $this->glueClient->getTables([ 'DatabaseName' => $databaseName, ]); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考GetTables中的。

下列程式碼範例會示範如何使用ListJobs

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

$jobs = $glueService->listJobs(); echo "Current jobs:\n"; foreach ($jobs['JobNames'] as $jobsName) { echo "{$jobsName}\n"; } public function listJobs($maxResults = null, $nextToken = null, $tags = []): Result { $arguments = []; if ($maxResults) { $arguments['MaxResults'] = $maxResults; } if ($nextToken) { $arguments['NextToken'] = $nextToken; } if (!empty($tags)) { $arguments['Tags'] = $tags; } return $this->glueClient->listJobs($arguments); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考ListJobs中的。

下列程式碼範例會示範如何使用StartCrawler

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

$crawlerName = "example-crawler-test-" . $uniqid; $databaseName = "doc-example-database-$uniqid"; $glueService->startCrawler($crawlerName); public function startCrawler($crawlerName): Result { return $this->glueClient->startCrawler([ 'Name' => $crawlerName, ]); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考StartCrawler中的。

下列程式碼範例會示範如何使用StartJobRun

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

$jobName = 'test-job-' . $uniqid; $databaseName = "doc-example-database-$uniqid"; $tables = $glueService->getTables($databaseName); $outputBucketUrl = "s3://$bucketName"; $runId = $glueService->startJobRun($jobName, $databaseName, $tables, $outputBucketUrl)['JobRunId']; public function startJobRun($jobName, $databaseName, $tables, $outputBucketUrl): Result { return $this->glueClient->startJobRun([ 'JobName' => $jobName, 'Arguments' => [ 'input_database' => $databaseName, 'input_table' => $tables['TableList'][0]['Name'], 'output_bucket_url' => $outputBucketUrl, '--input_database' => $databaseName, '--input_table' => $tables['TableList'][0]['Name'], '--output_bucket_url' => $outputBucketUrl, ], ]); }
  • 如需 API 詳細資訊,請參閱 AWS SDK for PHP API 參考StartJobRun中的。

案例

以下程式碼範例顯示做法:

  • 建立網路爬取公有 Amazon S3 儲存貯體的爬蟲程式,以及產生 CSV 格式中繼資料的資料庫。

  • 列出有關 AWS Glue Data Catalog.

  • 建立從 S3 儲存貯體中擷取 CSV 資料的任務、轉換資料,以及將 JSON 格式的輸出載入至另一個 S3 儲存貯體。

  • 列出任務執行的相關資訊、檢視已轉換的資料以及清除資源。

如需詳細資訊,請參閱 < 教學課程:開始使用 AWS Glue Studio >。

適用於 PHP 的開發套件
注意

還有更多關於 GitHub。尋找完整範例,並了解如何在AWS 設定和執行程式碼範例儲存庫

namespace Glue; use Aws\Glue\GlueClient; use Aws\S3\S3Client; use AwsUtilities\AWSServiceClass; use GuzzleHttp\Psr7\Stream; use Iam\IAMService; class GettingStartedWithGlue { public function run() { echo("\n"); echo("--------------------------------------\n"); print("Welcome to the AWS Glue getting started demo using PHP!\n"); echo("--------------------------------------\n"); $clientArgs = [ 'region' => 'us-west-2', 'version' => 'latest', 'profile' => 'default', ]; $uniqid = uniqid(); $glueClient = new GlueClient($clientArgs); $glueService = new GlueService($glueClient); $iamService = new IAMService(); $crawlerName = "example-crawler-test-" . $uniqid; AWSServiceClass::$waitTime = 5; AWSServiceClass::$maxWaitAttempts = 20; $role = $iamService->getRole("AWSGlueServiceRole-DocExample"); $databaseName = "doc-example-database-$uniqid"; $path = 's3://crawler-public-us-east-1/flight/2016/csv'; $glueService->createCrawler($crawlerName, $role['Role']['Arn'], $databaseName, $path); $glueService->startCrawler($crawlerName); echo "Waiting for crawler"; do { $crawler = $glueService->getCrawler($crawlerName); echo "."; sleep(10); } while ($crawler['Crawler']['State'] != "READY"); echo "\n"; $database = $glueService->getDatabase($databaseName); echo "Found a database named " . $database['Database']['Name'] . "\n"; //Upload job script $s3client = new S3Client($clientArgs); $bucketName = "test-glue-bucket-" . $uniqid; $s3client->createBucket([ 'Bucket' => $bucketName, 'CreateBucketConfiguration' => ['LocationConstraint' => 'us-west-2'], ]); $s3client->putObject([ 'Bucket' => $bucketName, 'Key' => 'run_job.py', 'SourceFile' => __DIR__ . '/flight_etl_job_script.py' ]); $s3client->putObject([ 'Bucket' => $bucketName, 'Key' => 'setup_scenario_getting_started.yaml', 'SourceFile' => __DIR__ . '/setup_scenario_getting_started.yaml' ]); $tables = $glueService->getTables($databaseName); $jobName = 'test-job-' . $uniqid; $scriptLocation = "s3://$bucketName/run_job.py"; $job = $glueService->createJob($jobName, $role['Role']['Arn'], $scriptLocation); $outputBucketUrl = "s3://$bucketName"; $runId = $glueService->startJobRun($jobName, $databaseName, $tables, $outputBucketUrl)['JobRunId']; echo "waiting for job"; do { $jobRun = $glueService->getJobRun($jobName, $runId); echo "."; sleep(10); } while (!array_intersect([$jobRun['JobRun']['JobRunState']], ['SUCCEEDED', 'STOPPED', 'FAILED', 'TIMEOUT'])); echo "\n"; $jobRuns = $glueService->getJobRuns($jobName); $objects = $s3client->listObjects([ 'Bucket' => $bucketName, ])['Contents']; foreach ($objects as $object) { echo $object['Key'] . "\n"; } echo "Downloading " . $objects[1]['Key'] . "\n"; /** @var Stream $downloadObject */ $downloadObject = $s3client->getObject([ 'Bucket' => $bucketName, 'Key' => $objects[1]['Key'], ])['Body']->getContents(); echo "Here is the first 1000 characters in the object."; echo substr($downloadObject, 0, 1000); $jobs = $glueService->listJobs(); echo "Current jobs:\n"; foreach ($jobs['JobNames'] as $jobsName) { echo "{$jobsName}\n"; } echo "Delete the job.\n"; $glueClient->deleteJob([ 'JobName' => $job['Name'], ]); echo "Delete the tables.\n"; foreach ($tables['TableList'] as $table) { $glueService->deleteTable($table['Name'], $databaseName); } echo "Delete the databases.\n"; $glueClient->deleteDatabase([ 'Name' => $databaseName, ]); echo "Delete the crawler.\n"; $glueClient->deleteCrawler([ 'Name' => $crawlerName, ]); $deleteObjects = $s3client->listObjectsV2([ 'Bucket' => $bucketName, ]); echo "Delete all objects in the bucket.\n"; $deleteObjects = $s3client->deleteObjects([ 'Bucket' => $bucketName, 'Delete' => [ 'Objects' => $deleteObjects['Contents'], ] ]); echo "Delete the bucket.\n"; $s3client->deleteBucket(['Bucket' => $bucketName]); echo "This job was brought to you by the number $uniqid\n"; } } namespace Glue; use Aws\Glue\GlueClient; use Aws\Result; use function PHPUnit\Framework\isEmpty; class GlueService extends \AwsUtilities\AWSServiceClass { protected GlueClient $glueClient; public function __construct($glueClient) { $this->glueClient = $glueClient; } public function getCrawler($crawlerName) { return $this->customWaiter(function () use ($crawlerName) { return $this->glueClient->getCrawler([ 'Name' => $crawlerName, ]); }); } public function createCrawler($crawlerName, $role, $databaseName, $path): Result { return $this->customWaiter(function () use ($crawlerName, $role, $databaseName, $path) { return $this->glueClient->createCrawler([ 'Name' => $crawlerName, 'Role' => $role, 'DatabaseName' => $databaseName, 'Targets' => [ 'S3Targets' => [[ 'Path' => $path, ]] ], ]); }); } public function startCrawler($crawlerName): Result { return $this->glueClient->startCrawler([ 'Name' => $crawlerName, ]); } public function getDatabase(string $databaseName): Result { return $this->customWaiter(function () use ($databaseName) { return $this->glueClient->getDatabase([ 'Name' => $databaseName, ]); }); } public function getTables($databaseName): Result { return $this->glueClient->getTables([ 'DatabaseName' => $databaseName, ]); } public function createJob($jobName, $role, $scriptLocation, $pythonVersion = '3', $glueVersion = '3.0'): Result { return $this->glueClient->createJob([ 'Name' => $jobName, 'Role' => $role, 'Command' => [ 'Name' => 'glueetl', 'ScriptLocation' => $scriptLocation, 'PythonVersion' => $pythonVersion, ], 'GlueVersion' => $glueVersion, ]); } public function startJobRun($jobName, $databaseName, $tables, $outputBucketUrl): Result { return $this->glueClient->startJobRun([ 'JobName' => $jobName, 'Arguments' => [ 'input_database' => $databaseName, 'input_table' => $tables['TableList'][0]['Name'], 'output_bucket_url' => $outputBucketUrl, '--input_database' => $databaseName, '--input_table' => $tables['TableList'][0]['Name'], '--output_bucket_url' => $outputBucketUrl, ], ]); } public function listJobs($maxResults = null, $nextToken = null, $tags = []): Result { $arguments = []; if ($maxResults) { $arguments['MaxResults'] = $maxResults; } if ($nextToken) { $arguments['NextToken'] = $nextToken; } if (!empty($tags)) { $arguments['Tags'] = $tags; } return $this->glueClient->listJobs($arguments); } public function getJobRuns($jobName, $maxResults = 0, $nextToken = ''): Result { $arguments = ['JobName' => $jobName]; if ($maxResults) { $arguments['MaxResults'] = $maxResults; } if ($nextToken) { $arguments['NextToken'] = $nextToken; } return $this->glueClient->getJobRuns($arguments); } public function getJobRun($jobName, $runId, $predecessorsIncluded = false): Result { return $this->glueClient->getJobRun([ 'JobName' => $jobName, 'RunId' => $runId, 'PredecessorsIncluded' => $predecessorsIncluded, ]); } public function deleteJob($jobName) { return $this->glueClient->deleteJob([ 'JobName' => $jobName, ]); } public function deleteTable($tableName, $databaseName) { return $this->glueClient->deleteTable([ 'DatabaseName' => $databaseName, 'Name' => $tableName, ]); } public function deleteDatabase($databaseName) { return $this->glueClient->deleteDatabase([ 'Name' => $databaseName, ]); } public function deleteCrawler($crawlerName) { return $this->glueClient->deleteCrawler([ 'Name' => $crawlerName, ]); } }