Streaming data to tables with Amazon Data Firehose - Amazon Simple Storage Service

Streaming data to tables with Amazon Data Firehose

Amazon Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, Apache Iceberg tables, and custom HTTP endpoints or HTTP endpoints owned by supported third-party service providers. With Amazon Data Firehose, you don't need to write applications or manage resources. You configure your data producers to send data to Firehose, and it automatically delivers the data to the destination that you specified. You can also configure Firehose to transform your data before delivering it. To learn more about Amazon Data Firehose, see What is Amazon Data Firehose?

After you integrate your table buckets with AWS analytics services, you do the following:

  1. Configure Firehose to deliver data into your S3 tables. To do so, you create an AWS Identity and Access Management (IAM) service role that allows Firehose to access your tables.

  2. Create a resource link to your table or table's namespace.

  3. Grant the Firehose service role explicit permissions to your table or table's namespace by granting permissions on the resource link.

  4. Create a Firehose stream that routes data to your table.

Creating a role for Firehose to use S3 tables as a destination

Firehose needs an IAM service role with specific permissions to access AWS Glue tables and write data to S3 tables. You need this provide this IAM role when you create a Firehose stream.

  1. Open the IAM console at https://console.aws.amazon.com/iam/.

  2. In the left navigation pane, choose Policies

  3. Choose Create a policy, and choose JSON in policy editor.

  4. Add the following inline policy that grants permissions to all databases and tables in your data catalog. If you want, you can give permissions only to specific tables and databases. To use this policy, replace the user input placeholders with your own information.

    { "Version": "2012-10-17", "Statement": [ { "Sid": "S3TableAccessViaGlueFederation", "Effect": "Allow", "Action": [ "glue:GetTable", "glue:GetDatabase", "glue:UpdateTable" ], "Resource": [ "arn:aws:glue:region:account-id:catalog/s3tablescatalog/*", "arn:aws:glue:region:account-id:catalog/s3tablescatalog", "arn:aws:glue:region:account-id:catalog", "arn:aws:glue:region:account-id:database/*", "arn:aws:glue:region:account-id:table/*/*" ] }, { "Sid": "S3DeliveryErrorBucketPermission", "Effect": "Allow", "Action": [ "s3:AbortMultipartUpload", "s3:GetBucketLocation", "s3:GetObject", "s3:ListBucket", "s3:ListBucketMultipartUploads", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::error delivery bucket", "arn:aws:s3:::error delivery bucket/*" ] }, { "Sid": "RequiredWhenUsingKinesisDataStreamsAsSource", "Effect": "Allow", "Action": [ "kinesis:DescribeStream", "kinesis:GetShardIterator", "kinesis:GetRecords", "kinesis:ListShards" ], "Resource": "arn:aws:kinesis:region:account-id:stream/stream-name" }, { "Sid": "RequiredWhenDoingMetadataReadsANDDataAndMetadataWriteViaLakeformation", "Effect": "Allow", "Action": [ "lakeformation:GetDataAccess" ], "Resource": "*" }, { "Sid": "RequiredWhenUsingKMSEncryptionForS3ErrorBucketDelivery", "Effect": "Allow", "Action": [ "kms:Decrypt", "kms:GenerateDataKey" ], "Resource": [ "arn:aws:kms:region:account-id:key/KMS-key-id" ], "Condition": { "StringEquals": { "kms:ViaService": "s3.region.amazonaws.com" }, "StringLike": { "kms:EncryptionContext:aws:s3:arn": "arn:aws:s3:::error delivery bucket/prefix*" } } }, { "Sid": "LoggingInCloudWatch", "Effect": "Allow", "Action": [ "logs:PutLogEvents" ], "Resource": [ "arn:aws:logs:region:account-id:log-group:log-group-name:log-stream:log-stream-name" ] }, { "Sid": "RequiredWhenAttachingLambdaToFirehose", "Effect": "Allow", "Action": [ "lambda:InvokeFunction", "lambda:GetFunctionConfiguration" ], "Resource": [ "arn:aws:lambda:region:account-id:function:function-name:function-version" ] } ] }

    This policy has a statements that allow access to Kinesis Data Streams, invoking Lambda functions and access to AWS KMS keys. If you don't use any of these resources, you can remove the respective statements.

    If error logging is enabled, Firehose also sends data delivery errors to your CloudWatch log group and streams. For this, you must configure log group and log stream names. For log group and log stream names, see Monitor Amazon Data Firehose Using CloudWatch Logs.

  5. After you create the policy, create an IAM role with AWS service as the Trusted entity type.

  6. For Service or use case, choose Kinesis. For Use case choose Kinesis Firehose.

  7. Choose Next, and then select the policy you created earlier.

  8. Give your role a name. Review your role details, and choose Create role. The role will have the following trust policy.

    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sts:AssumeRole" ], "Principal": { "Service": [ "firehose.amazonaws.com" ] } } ] }

To access your tables, Amazon Data Firehose needs a resource link that targets your table's namespace. A resource link is a Data Catalog object that acts as an alias or pointer to another Data Catalog resource, such as a database or table. The link is stored in the Data Catalog of the account or Region where it's created. For more information, see How resource links work in the AWS Lake Formation Developer Guide.

After you've integrated your table buckets with the AWS analytics services, you can create resource links to work with your tables in Firehose.

You create resource links to your table namespaces, and then provide the name of the link to Firehose so that Firehose can work with the linked tables.

The following AWS CLI command creates a resource link that you can use to connect your S3 tables to Firehose. To use this example command, replace the user input placeholders with your own information.

aws glue create-database --region us-east-1 \ --catalog-id "111122223333" \ --database-input \ '{ "Name": "resource-link-name", "TargetDatabase": { "CatalogId": "111122223333:s3tablescatalog/amzn-s3-demo-table-bucket", "DatabaseName": "my_namespace" }, "CreateTableDefaultPermissions": [] }'
Note

You must separately grant permissions to both the resource link and the target (linked) namespace. For more information, see Granting permission on a resource link.

When you use a resource link to access your tables, you must separately grant permissions to both the resource link and the target (linked) namespace or table. You can grant an IAM principal Lake Formation permissions on a resource link that's linked to your table namespace either through the Lake Formation console or the AWS CLI.

Console
  1. Open the AWS Lake Formation console at https://console.aws.amazon.com/lakeformation/, and sign in as a data lake administrator. For more information on how to create a data lake administrator, see Create a data lake administrator in the AWS Lake Formation Developer Guide.

  2. In the navigation pane, choose Data permissions, and then choose Grant.

  3. On the Grant Permissions page, under Principals, choose IAM users and roles, and select the service role that you created to stream to tables.

  4. Under LF-Tags or catalog resources, choose Named Data Catalog resources.

  5. For Catalogs, choose your account ID, which is the Default catalog.

  6. For Databases, choose the resource link that you created for your table namespace.

  7. For Resource link permissions, choose Describe.

  8. Choose Grant.

CLI
  1. Make sure that you're running AWS CLI commands as a data lake administrator. For more information, see Create a data lake administrator in the AWS Lake Formation Developer Guide.

  2. Run the following command to grant Lake Formation permissions on a table in an S3 table bucket to an IAM principal so that the principal can access the table. To use this example, replace the user input placeholders with your own information. The DataLakePrincipalIdentifier value can be either an IAM user or role ARN.

    aws lakeformation grant-permissions \ --principal DataLakePrincipalIdentifier=arn:aws:iam::account-id:role/role-name \ --resource Database='{CatalogId=account-id, Name=database-name}' \ --permissions DESCRIBE

Setting up a Firehose stream to S3 tables

The following procedure shows how to setup a Firehose stream to deliver data to S3 tables using the console. The following prerequisites are required to set up a Firehose stream to S3 tables.

Prerequisites

To provide routing information to Firehose when you configure a stream, you use the name of resource link you created for your namespace as the database name and the name of a table in that namespace. You can use these values in the Unique key section of a Firehose stream configuration to route data to a single table. You can also use this values to route to a table using JSON Query expressions. For more information, see Route incoming records to a single Iceberg table.

To set up a Firehose stream to S3 tables (Console)
  1. Open the Firehose console at https://console.aws.amazon.com/firehose/.

  2. Choose Create Firehose stream.

  3. For Source, choose one of the following sources:

    • Amazon Kinesis Data Streams

    • Amazon MSK

    • Direct PUT

  4. For Destination, choose Apache Iceberg Tables.

  5. Enter a Firehose stream name.

  6. Configure your Source settings.

  7. For Destination settings, select Current Account and the AWS Region of the tables that you want to stream to.

  8. Configure database and table names using Unique Key configuration, JSONQuery expressions, or in a Lambda function. For more information, refer to Route incoming records to a single Iceberg table and Route incoming records to different Iceberg tables in the Amazon Data Firehose Developer Guide.

  9. Under Backup settings, specify a S3 backup bucket.

  10. For Existing IAM roles under Advanced settings, select the IAM role you created for Firehose.

  11. Choose Create Firehose stream.

For more information about the other settings that you can configure for a stream, see Set up the Firehose stream in the Amazon Data Firehose Developer Guide.