Which AWS Glue schema should I use for JSON output format?

CloudWatch metric stream output in JSON format

In a CloudWatch metric stream that uses the JSON format, each Firehose record contains multiple JSON objects separated by a newline character (\n). Each object includes a single data point of a single metric.

The JSON format that is used is fully compatible with AWS Glue and with Amazon Athena. If you have a Firehose delivery stream and an AWS Glue table formatted correctly, the format can be automatically transformed into Parquet format or Optimized Row Columnar (ORC) format before being stored in S3. For more information about transforming the format, see Converting Your Input Record Format in Firehose. For more information about the correct format for AWS Glue, see Which AWS Glue schema should I use for JSON output format?.

In the JSON format, the valid values for unit are the same as for the value of unit in the MetricDatum API structure. For more information, see MetricDatum. The value for the timestamp field is in epoch milliseconds, such as 1616004674229.

The following is an example of the format. In this example, the JSON is formatted for easy reading, but in practice the whole format is on a single line.


{
    "metric_stream_name": "MyMetricStream",
    "account_id": "1234567890",
    "region": "us-east-1",
    "namespace": "AWS/EC2",
    "metric_name": "DiskWriteOps",
    "dimensions": {
        "InstanceId": "i-123456789012"
    },
    "timestamp": 1611929698000,
    "value": {
        "count": 3.0,
        "sum": 20.0,
        "max": 18.0,
        "min": 0.0,
        "p99": 17.56,
        "p99.9": 17.8764,
        "TM(25%:75%)": 16.43
    },
    "unit": "Seconds"
}

Which AWS Glue schema should I use for JSON output format?

The following is an example of a JSON representation of the StorageDescriptor for an AWS Glue table, which would then be used by Firehose. For more information about StorageDescriptor, see StorageDescriptor.


{
  "Columns": [
    {
      "Name": "metric_stream_name",
      "Type": "string"
    },
    {
      "Name": "account_id",
      "Type": "string"
    },
    {
      "Name": "region",
      "Type": "string"
    },
    {
      "Name": "namespace",
      "Type": "string"
    },
    {
      "Name": "metric_name",
      "Type": "string"
    },
    {
      "Name": "timestamp",
      "Type": "timestamp"
    },
    {
      "Name": "dimensions",
      "Type": "map<string,string>"
    },
    {
      "Name": "value",
      "Type": "struct<min:double,max:double,count:double,sum:double,p99:double,p99.9:double>"
    },
    {
      "Name": "unit",
      "Type": "string"
    }
  ],
  "Location": "s3://amzn-s3-demo-bucket/",
  "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
  "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
  "SerdeInfo": {
    "SerializationLibrary": "org.apache.hive.hcatalog.data.JsonSerDe"
  },
  "Parameters": {
    "classification": "json"
  }
}

The preceding example is for data written on Amazon S3 in JSON format. Replace the values in the following fields with the indicated values to store the data in Parquet format or Optimized Row Columnar (ORC) format.

Parquet:
- inputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
- outputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
- SerDeInfo.serializationLib: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
- parameters.classification: parquet
ORC:
- inputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
- outputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
- SerDeInfo.serializationLib: org.apache.hadoop.hive.ql.io.orc.OrcSerde
- parameters.classification: orc

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Trust between CloudWatch and Firehose

OpenTelemetry 1.0.0 output format