Extract and query AWS IoT SiteWise metadata attributes in a data lake - AWS Prescriptive Guidance

Extract and query AWS IoT SiteWise metadata attributes in a data lake

Created by Ambarish Dongaonkar (AWS)

Environment: Production

Technologies: IoT; Analytics; Big data

AWS services: AWS IoT SiteWise; AWS Lambda; AWS Glue

Summary

AWS IoT SiteWise uses asset models and hierarchies to represent your industrial equipment, processes, and facilities. Each model or asset can have multiple attributes that are specific to your environment. Example metadata attributes include the site or physical location of the asset, plant details, and equipment identifiers. These attribute values complement asset measurement data to maximize the business value. Machine learning (ML) can provide additional insights into this metadata and streamline engineering tasks.

However, metadata attributes can’t be queried directly from the AWS IoT SiteWise service. To make the attributes queryable, you must extract and ingest them into a data lake. This pattern uses a Python script to extract the attributes for all AWS IoT SiteWise assets and ingest them into a data lake in an Amazon Simple Storage Service (Amazon S3) bucket. When you have completed this process, you can use SQL queries in Amazon Athena to access the AWS IoT SiteWise metadata attributes and other datasets, such as measurement datasets. The metadata attribute information is also useful when working with AWS IoT SiteWise monitors or dashboards. You can also build an AWS QuickSight dashboard by using the extracted attributes in the S3 bucket.

The pattern has reference code, and you can you can implement the code by using the best compute services for your use case, such as AWS Lambda or AWS Glue.

Prerequisites and limitations

Prerequisites 

  • An active AWS account.

  • Permissions to set up AWS Lambda functions or AWS Glue jobs.

  • An Amazon S3 bucket.

  • The asset models and hierarchies are set up in AWS IoT SiteWise. For more information, see Creating asset models (AWS IoT SiteWise documentation).

Architecture

You can use a Lambda function or an AWS Glue job to complete this process. We recommend using Lambda if you have less than 100 models and each model has an average of 15 or fewer attributes. For all other use cases, we recommend using AWS Glue.

The solution architecture and workflow are shown in the following diagram.

Architecture diagram showing the extraction and query process described.
  1. The scheduled AWS Glue job or Lambda function runs. It extracts the asset metadata attributes from AWS IoT SiteWise and ingests them into an S3 bucket.

  2. An AWS Glue crawler crawls the extracted data in the S3 bucket and creates tables in an AWS Glue Data Catalog.

  3. Using standard SQL, Amazon Athena queries the tables in the AWS Glue Data Catalog.

Automation and scale

You can schedule the Lambda function or AWS Glue job to run daily or weekly, according to the update frequency of your AWS IoT SiteWise asset configurations.

There is no limit to the number of AWS IoT SiteWise assets that the sample code can process, but a large number of assets can increase the amount of time required to complete the process.

Tools

  • Amazon Athena is an interactive query service that helps you analyze data directly in Amazon Simple Storage Service (Amazon S3) by using standard SQL.

  • AWS Glue is a fully managed extract, transform, and load (ETL) service. It helps you reliably categorize, clean, enrich, and move data between data stores and data streams.

  • AWS Identity and Access Management (IAM) helps you securely manage access to your AWS resources by controlling who is authenticated and authorized to use them.

  • AWS IoT SiteWise helps you collect, model, analyze, and visualize data from industrial equipment at scale.

  • AWS Lambda is a compute service that helps you run code without needing to provision or manage servers. It runs your code only when needed and scales automatically, so you pay only for the compute time that you use.

  • Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

  • AWS SDK for Python (Boto3) is a software development kit that helps you integrate your Python application, library, or script with AWS services.

Epics

TaskDescriptionSkills required

Configure permissions in IAM.

In the IAM console, grant permissions to the IAM role assumed by the Lambda function or AWS Glue job to do the following:

  • Read from the AWS IoT SiteWise service

  • Write to the S3 bucket

For more information, see Creating a role for an AWS service (IAM documentation).

General AWS

Create the Lambda function or AWS Glue job.

If you are using Lambda, create a new Lambda function. For Runtime, choose Python. For more information, see Building Lambda functions with Python (Lambda documentation).

If you are using AWS Glue, create a new Python shell job in the AWS Glue console. For more information, see Adding Python shell jobs (AWS Glue documentation). 

General AWS

Update the Lambda function or AWS Glue job.

Modify the new Lambda function or AWS Glue job, and enter the code sample in the Additional information section. Modify the code as needed for your use case. For more information, see Edit code using the console editor (Lambda documentation) and Working with scripts (AWS Glue documentation).

General AWS
TaskDescriptionSkills required

Run the Lambda function or AWS Glue job.

Run the Lambda function or AWS Glue job. For more information, see Invoke the Lambda function (Lambda documentation) or Starting jobs using triggers (AWS Glue documentation). This extracts the metadata attributes for the assets and models in the AWS IoT SiteWise hierarchy and stores them in the specified S3 bucket.

General AWS

Set up an AWS Glue crawler.

Set up an AWS Glue crawler with the necessary format classifier for a CSV-formatted file. Use the S3 bucket and prefix details used in the Lambda function or AWS Glue job. For more information, see Defining crawlers (AWS Glue documentation).

General AWS

Run the AWS Glue crawler.

Run the crawler to process the data file created by the Lambda function or AWS Glue job. The crawler creates a table in the specified AWS Glue Data Catalog. For more information, see or Starting crawlers using triggers (AWS Glue documentation).

General AWS

Query the metadata attributes.

Using Amazon Athena, use standard SQL to query the AWS Glue Data Catalog as needed for your use case. You can join the metadata attribute table with other databases and tables. For more information, see Getting Started (Amazon Athena documentation).

General AWS

Related resources

Additional information

Code

The sample code provided is for reference, and you can customize this code as needed for your use case.

# Following code can be used in an AWS Lambda function or in an AWS Glue Python shell job.  # IAM roles used for this job need read access to the AWS IoT SiteWise service and write access to the S3 bucket. sw_client = boto3.client('iotsitewise') s3_client = boto3.client('s3') output = io.StringIO()   attribute_list=[] bucket = '{3_bucket name}' prefix = '{s3_bucket prefix}' output.write("model_id,model_name,asset_id,asset_name,attribuet_id,attribute_name,attribute_value\n")       m_resp = sw_client.list_asset_models() for m_rec in m_resp['assetModelSummaries']:      model_id = m_rec['id']      model_name = m_rec['name']        attribute_list.clear()      dam_response = sw_client.describe_asset_model(assetModelId=model_id)      for rec in dam_response['assetModelProperties']:          if 'attribute' in rec['type']:             attribute_list.append(rec['name'])            response = sw_client.list_assets(assetModelId=model_id, filter='ALL')      for asset in response['assetSummaries']:          asset_id = asset['id']          asset_name = asset['name']          resp = sw_client.describe_asset(assetId=asset_id)          for rec in resp['assetProperties']:             if rec['name'] in attribute_list:                 p_resp = sw_client.get_asset_property_value(assetId=asset_id, propertyId=rec['id'])                 if 'propertyValue' in p_resp:                     if p_resp['propertyValue']['value']:                         if 'stringValue' in p_resp['propertyValue']['value']:                              output.write(model_id + "," + model_name + "," + asset_id + "," + asset_name + "," + rec['id'] + "," + rec['name'] + "," + str(p_resp['propertyValue']['value']['stringValue']) + "\n")                                                      if 'doubleValue' in p_resp['propertyValue']['value']:                              output.write(model_id + "," + model_name + "," + asset_id + "," + asset_name + "," + rec['id'] + "," + rec['name'] + "," + str(p_resp['propertyValue']['value']['doubleValue']) + "\n")                         if 'integerValue' in p_resp['propertyValue']['value']:                              output.write(model_id + "," + model_name + "," + asset_id + "," + asset_name + "," + rec['id'] + "," + rec['name'] + "," + str(p_resp['propertyValue']['value']['integerValue']) + "\n")                          if 'booleanValue' in p_resp['propertyValue']['value']:                              output.write(model_id + "," + model_name + "," + asset_id + "," + asset_name + "," + rec['id'] + "," + rec['name'] + "," + str(p_resp['propertyValue']['value']['booleanValue']) + "\n")   output.seek(0) s3_client.put_object(Bucket=bucket, Key= prefix + '/data.csv', Body=output.getvalue()) output.close()