Watermarking In Apache Spark Structured Streaming

Watermarking In Apache Spark Structured Streaming

Published: None

Source: https://www.linkedin.com/pulse/watermarking-apache-spark-structured-streaming-arabinda-mohapatra-xc7sc?trackingId=XFuzrPv8SFqHvSecoJC58A%3D%3D


Watermarking In Apache Spark Structured Streaming

Running Kafka streams after dark, diving into genetic code by daylight, and wrestling with Databricks and Tableflow in every spare moment—sleep is optional

The watermark concept in Spark Structured Streaming helps manage the timing of aggregate results in the presence of late-arriving data. Here’s a breakdown of how it works in the context you've described:

Key Concepts:

  • Windowed Aggregation:

Spark Structured Streaming allows you to perform aggregations over specified time windows. For example, you might want to calculate the sum of events that occurred every 10 minutes.

  • Watermarking:

Watermarking helps Spark determine when it has seen enough data to produce accurate aggregates for a specific time window, accounting for late data. It specifies how long to wait for late data before closing the window and emitting the results.

  1. Example Scenario:

  • Watermark Definition:

Suppose you set a watermark of 10 minutes. This means Spark will wait up to 10 minutes after the end of a window to ensure that all data for that window has been processed.

  • Aggregation Window:

Let’s say you are aggregating data over 10-minute windows (e.g., from 10:50 AM to 11:00 AM).

Data Arrival and Processing:

  • At 11:00 AM: Spark hasn't seen data that’s late enough to ensure all data for the 10:50 AM to 11:00 AM window has been received. So, it doesn’t emit results yet. Instead, it updates its internal state with new data points.
  • At 11:10 AM: The watermark is still not satisfied because the latest event time seen (say 10:53 AM) minus the watermark (10 minutes) is still not greater than the end of the window (11:00 AM). Thus, it updates the state but doesn’t emit results.
  • At 11:20 AM: Spark has now seen a data point with an event time of 11:15 AM. With the 10-minute watermark, this means that 11:15 AM minus 10 minutes (11:05 AM) is now later than 11:00 AM.
  • Therefore, Spark can now safely emit the aggregate result for the 10:50 AM to 11:00 AM window because it’s confident that all relevant data has been processed


Article content

Visual Representation:

  • Initial State (11:00 AM): Window 10:50 AM to 11:00 AM is not yet emitted because the watermark condition isn't met.
  • Updated State (11:10 AM): The window’s state is updated with new data, but still not emitted.
  • Emit Results (11:20 AM): The watermark condition is met, so Spark emits the aggregate result for the 10:50 AM to 11:00 AM window and clears the state for that window.

Comments