Watermarking In Apache Spark Structured Streaming
Watermarking In Apache Spark Structured Streaming
Published: None
Watermarking In Apache Spark Structured Streaming
The watermark concept in Spark Structured Streaming helps manage the timing of aggregate results in the presence of late-arriving data. Here’s a breakdown of how it works in the context you've described:
Key Concepts:
- Windowed Aggregation:
Spark Structured Streaming allows you to perform aggregations over specified time windows. For example, you might want to calculate the sum of events that occurred every 10 minutes.
- Watermarking:
Watermarking helps Spark determine when it has seen enough data to produce accurate aggregates for a specific time window, accounting for late data. It specifies how long to wait for late data before closing the window and emitting the results.
- Example Scenario:
- Watermark Definition:
Suppose you set a watermark of 10 minutes. This means Spark will wait up to 10 minutes after the end of a window to ensure that all data for that window has been processed.
- Aggregation Window:
Let’s say you are aggregating data over 10-minute windows (e.g., from 10:50 AM to 11:00 AM).
Data Arrival and Processing:
- At 11:00 AM: Spark hasn't seen data that’s late enough to ensure all data for the 10:50 AM to 11:00 AM window has been received. So, it doesn’t emit results yet. Instead, it updates its internal state with new data points.
- At 11:10 AM: The watermark is still not satisfied because the latest event time seen (say 10:53 AM) minus the watermark (10 minutes) is still not greater than the end of the window (11:00 AM). Thus, it updates the state but doesn’t emit results.
- At 11:20 AM: Spark has now seen a data point with an event time of 11:15 AM. With the 10-minute watermark, this means that 11:15 AM minus 10 minutes (11:05 AM) is now later than 11:00 AM.
- Therefore, Spark can now safely emit the aggregate result for the 10:50 AM to 11:00 AM window because it’s confident that all relevant data has been processed
Visual Representation:
- Initial State (11:00 AM): Window 10:50 AM to 11:00 AM is not yet emitted because the watermark condition isn't met.
- Updated State (11:10 AM): The window’s state is updated with new data, but still not emitted.
- Emit Results (11:20 AM): The watermark condition is met, so Spark emits the aggregate result for the 10:50 AM to 11:00 AM window and clears the state for that window.
Comments
Post a Comment