← Back to all products

Real-Time Streaming Toolkit

$69

Production streaming patterns for Structured Streaming and Delta Live Tables with Kafka/Event Hub integration and monitoring dashboards.

📁 15 files🏷 v1.0.0
PythonSQLMarkdownJSONAzureDatabricksPySparkSparkDelta Lake

📁 File Structure 15 files

real-time-streaming-toolkit/ ├── README.md ├── config/ │ ├── autoscaling_config.md │ └── checkpoint_management.md ├── dlt/ │ ├── cdc_processing.py │ ├── expectations_library.py │ └── streaming_medallion.py ├── guides/ │ ├── failure_recovery.md │ └── performance_tuning.md ├── monitoring/ │ ├── streaming_alerts.sql │ └── streaming_dashboard.sql └── structured-streaming/ ├── auto_loader_streaming.py ├── deduplication.py ├── event_hub_source.py └── kafka_source.py

📖 Documentation Preview README excerpt

Real-Time Streaming Toolkit

Product ID: real-time-streaming-toolkit

Version: 1.0.0

Price: $69

Category: Data Engineering

Author: [Datanest Digital](https://datanest.dev)

---

Overview

The Real-Time Streaming Toolkit is a production-grade collection of PySpark notebooks, Delta Live Tables pipelines, monitoring dashboards, and operational guides for building robust real-time data pipelines on Databricks. Every component has been battle-tested against high-throughput workloads and designed for exactly-once processing guarantees.

Whether you are ingesting from Kafka, Azure Event Hubs, or cloud object storage via Auto Loader, this toolkit gives you a proven starting point that eliminates weeks of trial-and-error engineering.

What's Included

Structured Streaming Notebooks

| File | Description |

|------|-------------|

| structured-streaming/kafka_source.py | Kafka source with schema registry integration, checkpoint management, and consumer group orchestration |

| structured-streaming/event_hub_source.py | Azure Event Hub source with native checkpoint store, partition-aware processing, and backpressure handling |

| structured-streaming/auto_loader_streaming.py | Auto Loader (cloudFiles) patterns for file-based streaming from S3, ADLS, and GCS |

| structured-streaming/deduplication.py | Exactly-once processing strategies including watermark-based deduplication, idempotent writes, and state management |

Delta Live Tables Pipelines

| File | Description |

|------|-------------|

| dlt/streaming_medallion.py | Full medallion architecture (Bronze/Silver/Gold) as a streaming DLT pipeline |

| dlt/cdc_processing.py | Change Data Capture processing with APPLY CHANGES INTO for SCD Type 1 and Type 2 |

| dlt/expectations_library.py | Reusable DLT data quality expectations library with severity levels and alerting hooks |

Monitoring & Alerting

| File | Description |

|------|-------------|

| monitoring/streaming_dashboard.sql | SQL dashboard queries for streaming lag, throughput, error rates, and checkpoint health |

| monitoring/streaming_alerts.sql | Alert queries for detecting pipeline failures, excessive lag, and data quality regressions |

Configuration Guides

| File | Description |

|------|-------------|

| config/autoscaling_config.md | Auto-scaling configuration for streaming clusters with recommended instance types and scaling policies |

| config/checkpoint_management.md | Checkpoint repair, migration, and disaster recovery procedures |

Operational Guides

| File | Description |

|------|-------------|

| guides/performance_tuning.md | Trigger intervals, partition sizing, shuffle optimization, and state store tuning |

| guides/failure_recovery.md | Failure recovery playbook covering checkpoint corruption, schema evolution, and cluster failures |

Requirements

  • Databricks Runtime 13.3 LTS or later (14.x+ recommended)
  • Unity Catalog enabled workspace (recommended)

... continues with setup instructions, usage examples, and more.

📄 Code Sample .py preview

dlt/cdc_processing.py # Databricks notebook source # MAGIC %md # MAGIC # CDC Processing — Delta Live Tables # MAGIC # MAGIC Change Data Capture (CDC) processing using DLT's `APPLY CHANGES INTO` for # MAGIC SCD Type 1 and Type 2 patterns. Supports Debezium, Maxwell, and custom CDC formats. # MAGIC # MAGIC **Datanest Digital** — https://datanest.dev # MAGIC # MAGIC --- # COMMAND ---------- # MAGIC %md # MAGIC ## Pipeline Configuration # MAGIC # MAGIC Set these values in the DLT pipeline settings under "Configuration": # MAGIC # MAGIC | Key | Example | Description | # MAGIC |-----|---------|-------------| # MAGIC | `cdc_source_type` | `kafka` | Source: kafka or autoloader | # MAGIC | `kafka_brokers` | `broker1:9092` | Kafka bootstrap servers | # MAGIC | `cdc_topic` | `cdc.customers` | Kafka topic for CDC events | # MAGIC | `cdc_format` | `debezium` | CDC format: debezium, maxwell, custom | # MAGIC | `cdc_source_path` | `abfss://...` | Auto Loader path for CDC files | # COMMAND ---------- import dlt from pyspark.sql import DataFrame from pyspark.sql.functions import ( col, from_json, current_timestamp, lit, expr, to_timestamp, when, coalesce, struct, sha2, concat_ws, date_format, from_unixtime, lower, trim ) from pyspark.sql.types import ( StructType, StructField, StringType, IntegerType, LongType, TimestampType, DoubleType, BooleanType, MapType ) # COMMAND ---------- # MAGIC %md # MAGIC ## CDC Schemas # COMMAND ---------- # Debezium CDC envelope schema DEBEZIUM_SCHEMA = StructType([ StructField("before", MapType(StringType(), StringType()), True), # ... 284 more lines ...