在範例資料上使用 AWS SDK - Amazon Comprehend

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

在範例資料上使用 AWS SDK

以下程式碼範例顯示做法:

  • 針對範例資料執行 Amazon Comprehend 主題建模任務。

  • 取得有關工作的資訊。

  • 從 Amazon S3 擷取任務輸出資料。

Python
SDK對於 Python(肉毒桿菌 3)
注意

還有更多關於 GitHub。尋找完整範例,並瞭解如何在 AWS 代碼示例存儲庫

建立包裝類別以呼叫 Amazon Comprehend 主題建模動作。

class ComprehendTopicModeler: """Encapsulates a Comprehend topic modeler.""" def __init__(self, comprehend_client): """ :param comprehend_client: A Boto3 Comprehend client. """ self.comprehend_client = comprehend_client def start_job( self, job_name, input_bucket, input_key, input_format, output_bucket, output_key, data_access_role_arn, ): """ Starts a topic modeling job. Input is read from the specified Amazon S3 input bucket and written to the specified output bucket. Output data is stored in a tar archive compressed in gzip format. The job runs asynchronously, so you can call `describe_topics_detection_job` to get job status until it returns a status of SUCCEEDED. :param job_name: The name of the job. :param input_bucket: An Amazon S3 bucket that contains job input. :param input_key: The prefix used to find input data in the input bucket. If multiple objects have the same prefix, all of them are used. :param input_format: The format of the input data, either one document per file or one document per line. :param output_bucket: The Amazon S3 bucket where output data is written. :param output_key: The prefix prepended to the output data. :param data_access_role_arn: The Amazon Resource Name (ARN) of a role that grants Comprehend permission to read from the input bucket and write to the output bucket. :return: Information about the job, including the job ID. """ try: response = self.comprehend_client.start_topics_detection_job( JobName=job_name, DataAccessRoleArn=data_access_role_arn, InputDataConfig={ "S3Uri": f"s3://{input_bucket}/{input_key}", "InputFormat": input_format.value, }, OutputDataConfig={"S3Uri": f"s3://{output_bucket}/{output_key}"}, ) logger.info("Started topic modeling job %s.", response["JobId"]) except ClientError: logger.exception("Couldn't start topic modeling job.") raise else: return response def describe_job(self, job_id): """ Gets metadata about a topic modeling job. :param job_id: The ID of the job to look up. :return: Metadata about the job. """ try: response = self.comprehend_client.describe_topics_detection_job( JobId=job_id ) job = response["TopicsDetectionJobProperties"] logger.info("Got topic detection job %s.", job_id) except ClientError: logger.exception("Couldn't get topic detection job %s.", job_id) raise else: return job def list_jobs(self): """ Lists topic modeling jobs for the current account. :return: The list of jobs. """ try: response = self.comprehend_client.list_topics_detection_jobs() jobs = response["TopicsDetectionJobPropertiesList"] logger.info("Got %s topic detection jobs.", len(jobs)) except ClientError: logger.exception("Couldn't get topic detection jobs.") raise else: return jobs

使用包裝函式類別來執行主題模型工作並取得工作資料。

def usage_demo(): print("-" * 88) print("Welcome to the Amazon Comprehend topic modeling demo!") print("-" * 88) logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s") input_prefix = "input/" output_prefix = "output/" demo_resources = ComprehendDemoResources( boto3.resource("s3"), boto3.resource("iam") ) topic_modeler = ComprehendTopicModeler(boto3.client("comprehend")) print("Setting up storage and security resources needed for the demo.") demo_resources.setup("comprehend-topic-modeler-demo") print("Copying sample data from public bucket into input bucket.") demo_resources.bucket.copy( {"Bucket": "public-sample-us-west-2", "Key": "TopicModeling/Sample.txt"}, f"{input_prefix}sample.txt", ) print("Starting topic modeling job on sample data.") job_info = topic_modeler.start_job( "demo-topic-modeling-job", demo_resources.bucket.name, input_prefix, JobInputFormat.per_line, demo_resources.bucket.name, output_prefix, demo_resources.data_access_role.arn, ) print( f"Waiting for job {job_info['JobId']} to complete. This typically takes " f"20 - 30 minutes." ) job_waiter = JobCompleteWaiter(topic_modeler.comprehend_client) job_waiter.wait(job_info["JobId"]) job = topic_modeler.describe_job(job_info["JobId"]) print(f"Job {job['JobId']} complete:") pprint(job) print( f"Getting job output data from the output Amazon S3 bucket: " f"{job['OutputDataConfig']['S3Uri']}." ) job_output = demo_resources.extract_job_output(job) lines = 10 print(f"First {lines} lines of document topics output:") pprint(job_output["doc-topics.csv"]["data"][:lines]) print(f"First {lines} lines of terms output:") pprint(job_output["topic-terms.csv"]["data"][:lines]) print("Cleaning up resources created for the demo.") demo_resources.cleanup() print("Thanks for watching!") print("-" * 88)

有關的完整列表 AWS SDK開發人員指南和代碼示例,請參閱使用 Amazon Comprehend 與 SDK AWS。本主題也包含有關入門的資訊以及舊SDK版的詳細資訊。