AWS Glue Streaming - AWS Glue

AWS Glue Streaming

AWS Glue Streaming, a component of AWS Glue, enables you to efficiently handle streaming data in near real-time, empowering you to carry out crucial tasks such as data ingestion, processing, and machine learning. Using the Apache Spark Streaming framework, AWS Glue Streaming provides a serverless service that can handle streaming data at scale. AWS Glue provides various optimizations on top of Apache Spark such as serverless infrastructure, auto-scaling, visual job development, instant-on notebooks for streaming jobs and other performance improvements.

Use cases for streaming

Some common use cases for AWS Glue Streaming include:

Near-real-time data processing: AWS Glue Streaming allows organizations to process streaming data in near real-time, enabling them to derive insights and make timely decisions based on the latest information.

Fraud detection: You can utilize AWS Glue Streaming for real-time analysis of streaming data, making it valuable for detecting fraudulent activities, such as credit card fraud, network intrusion, or online scams. By continuously processing and analyzing incoming data, you can swiftly identify suspicious patterns or anomalies.

Social media analytics: AWS Glue Streaming can process real-time social media data, such as tweets, posts, or comments, enabling organizations to monitor trends, sentiment analysis, and manage brand reputation in real-time.

Internet of Things (IoT) analytics: AWS Glue Streaming is suitable for handling and analyzing high-velocity streams of data generated by IoT devices, sensors, and connected machinery. It allows for real-time monitoring, anomaly detection, predictive maintenance, and other IoT analytics use cases.

Clickstream analysis: AWS Glue Streaming can process and analyze real-time clickstream data from websites or mobile applications. This enables businesses to gain insights into user behavior, personalize user experiences, and optimize marketing campaigns based on real-time clickstream data.

Log monitoring and analysis: AWS Glue Streaming can continuously process and analyze log data from servers, applications, or network devices in real-time. This helps in detecting anomalies, troubleshooting issues, and monitoring system health and performance.

Recommendation systems: AWS Glue Streaming can process user activity data in real-time and update recommendation models dynamically. This allows for personalized and real-time recommendations based on user behavior and preferences.

These are some examples of the diverse range of use cases where AWS Glue Streaming can be applied. Its integration with the AWS ecosystem and managed services make it a convenient choice for real-time stream processing and analytics in the cloud.

What are the benefits of using AWS Glue Streaming?

The benefits of using AWS Glue Streaming are as follows:

  • Serverless: AWS Glue Streaming is serverless, eliminating the need to manage infrastructure. This reduces the operational overhead and allows users to focus on data processing and analytics tasks rather than infrastructure management.

  • Autoscaling: AWS Glue Streaming provides autoscaling capabilities, dynamically adjusting the processing capacity based on the workload. It automatically scales out or in to handle fluctuations in data volume, ensuring optimal performance and resource utilization.

  • Visual development: Streaming job development can be complex. AWS Glue Streaming addresses this challenge by offering AWS Glue Studio, a visual authoring tool. AWS Glue Studio simplifies the process of creating streaming workflows and enables developers to design and manage streaming applications visually, reducing the learning curve and increasing productivity.

  • Cost-effective: As a serverless service, AWS Glue Streaming offers cost efficiency by eliminating the need for provisioning and maintaining infrastructure. Users are billed based on the resources consumed during the execution of streaming jobs, allowing for cost optimization and scaling based on actual usage.

  • Handles complex workloads: AWS Glue Streaming is designed to handle complex streaming workloads. It can process and analyze large volumes of real-time data, support advanced transformations, and integrate with other AWS services, enabling sophisticated streaming data pipelines and analytics workflows.

  • No lock-in: AWS Glue Streaming provides flexibility and avoids vendor lock-in. Users can leverage AWS Glue Streaming as part of the broader AWS ecosystem, integrating it with other AWS services seamlessly. This allows for easy integration with existing data sources, applications, and services without being tied to a specific technology or platform.

When to use AWS Glue Streaming?

There are many options when it comes to streaming use cases. We recommend AWS Glue streaming in the following scenarios.

  1. If you are already using AWS Glue or Spark for batch processing, AWS Glue Streaming is the ideal choice for you. It provides a seamless transition to building streaming jobs without the need to learn a new language or framework. Leveraging your existing knowledge and infrastructure, AWS Glue Streaming simplifies the job development process and allows you to easily extend your data processing capabilities to real-time streaming scenarios.

  2. If you require a unified service or product to handle batch, streaming, and event-driven workloads, AWS Glue Streaming is the solution for you. With AWS Glue Streaming, you can consolidate your data processing needs into a single framework, eliminating the complexity of managing multiple systems. This enables efficient development and maintenance of diverse data workflows while ensuring consistency and compatibility across different workload types.

  3. AWS Glue Streaming is well-suited for scenarios involving extremely large streaming data volumes and complex transformations, such as joins between streams or relational databases. It can efficiently process and analyze massive streams of data, enabling you to tackle demanding workloads with ease. Whether it is high-velocity data ingestion or intricate data manipulations, AWS Glue Streaming's scalability and advanced processing capabilities ensure optimal performance and accurate results.

  4. If you prefer a visual approach to building streaming jobs, AWS Glue offers AWS Glue Studio, with which you can visually design and manage your streaming applications, simplifying the development process. This intuitive interface enables developers to create, configure, and monitor streaming workflows using a visual interface, reducing the learning curve and increasing productivity.

  5. AWS Glue Streaming is an excellent choice for near-real-time use cases where there are stringent SLAs (Service Level Agreements) greater than 10 seconds.

  6. If you are building a transactional data lake using Apache Iceberg, Apache Hudi, or Delta Lake, AWS Glue Streaming provides native support for these open table formats. This seamless integration enables you to process streaming data directly from these transactional data lakes, ensuring data consistency, integrity, and compatibility.

  7. When needing to ingest streaming data for a variety of data targets: AWS Glue Streaming provides native targets to a variety of data targets such as Amazon Redshift, Amazon RDS, Amazon Aurora, Oracle, SQL Server and other targets.

Supported data sources

AWS Glue Streaming supports the following data sources:

  • Amazon Kinesis

  • Amazon MSK (Managed Streaming for Apache Kafka)

  • Self-managed Apache Kafka

Supported data targets

AWS Glue Streaming supports a variety of data targets such as:

  • Data targets supported by AWS Glue Data Catalog

  • Amazon S3

  • Amazon Redshift

  • MySQL

  • PostgreSQL

  • Oracle

  • Microsoft SQL Server

  • Snowflake

  • Any database that can be connected using JDBC

  • Apache Iceberg, Delta and Apache Hudi

  • AWS Glue Marketplace connectors