← Back to all products

Databricks Monitoring & Alerting Suite

$59

Complete monitoring and observability setup with 8 SQL dashboards, alert definitions, webhook templates, and 50+ system table queries.

📁 15 files🏷 v1.0.0
SQLMarkdownJSONDatabricksRedis

📁 File Structure 15 files

databricks-monitoring-suite/ ├── README.md ├── alerts/ │ ├── alert_definitions.sql │ └── webhook_templates.json ├── dashboards/ │ ├── capacity_planning.sql │ ├── cluster_utilization.sql │ ├── cost_trends.sql │ ├── data_freshness.sql │ ├── job_failure_analysis.sql │ ├── pipeline_health.sql │ ├── query_performance.sql │ └── user_activity.sql ├── queries/ │ └── system_table_library.sql ├── runbooks/ │ └── common_operations.md └── templates/ └── monthly_health_report.md

📖 Documentation Preview README excerpt

Databricks Monitoring & Alerting Suite

Version: 1.0.0

Author: [Datanest Digital](https://datanest.dev)

Price: $59

Category: Databricks

---

Overview

A comprehensive, production-ready monitoring and alerting toolkit for Databricks workspaces. This suite provides 8 SQL dashboards, configurable alert definitions, 50+ pre-built system table queries, webhook integrations, report templates, and operational runbooks -- everything you need to achieve full observability over your Databricks environment.

What's Included

Dashboards (8)

| Dashboard | Description |

|-----------|-------------|

| Pipeline Health | Success rates, failure trends, duration tracking across all pipelines |

| Cluster Utilization | CPU, memory, idle time, and cost attribution per cluster |

| Job Failure Analysis | Error categorization, root cause patterns, failure heatmaps |

| Cost Trends | Daily/weekly/monthly DBU spend broken down by team, workspace, SKU |

| User Activity | Active users, notebook execution patterns, query frequency analysis |

| Data Freshness | Table update timestamps vs SLA targets with breach detection |

| Query Performance | SQL warehouse query latency, throughput, and optimization signals |

| Capacity Planning | Growth projections, resource demand forecasting, headroom analysis |

Alerts

  • Alert Definitions -- SQL-based alert rules for job failures, cost spikes, idle clusters, SLA breaches, and more
  • Webhook Templates -- Ready-to-use payload templates for Slack, Microsoft Teams, PagerDuty, and email

Queries

  • System Table Library -- 50+ pre-built queries against system.billing, system.access, system.compute, and related tables covering billing analysis, access auditing, compute profiling, and operational diagnostics

Templates & Runbooks

  • Monthly Health Report -- Markdown template with embedded query references for generating executive-level monthly reports
  • Common Operations Runbook -- Step-by-step procedures for incident response, cost optimization, capacity management, and access review

Prerequisites

  • Databricks workspace with Unity Catalog enabled
  • Access to [system tables](https://docs.databricks.com/en/administration-guide/system-tables/index.html) (system.billing, system.access, system.compute)
  • SQL warehouse (Serverless or Pro recommended)
  • Databricks SQL Alerts & Dashboards feature enabled

Quick Start

1. Import Dashboards -- Open each .sql file in dashboards/ and create a new Databricks SQL dashboard with the queries

2. Configure Alerts -- Run alerts/alert_definitions.sql to create alert rules, then configure destinations using the webhook templates in alerts/webhook_templates.json

3. Explore Queries -- Use queries/system_table_library.sql as a reference library; copy individual queries into notebooks or dashboards as needed

4. Schedule Reports -- Adapt templates/monthly_health_report.md to your organization and schedule the underlying queries

5. Adopt Runbooks -- Customize runbooks/common_operations.md with your team's escalation paths and thresholds

File Structure



*... continues with setup instructions, usage examples, and more.*

📄 Code Sample .sql preview

alerts/alert_definitions.sql -- ============================================================================ -- Alert Definitions -- Databricks Monitoring & Alerting Suite -- Author: Datanest Digital (https://datanest.dev) -- ============================================================================ -- SQL alert queries for Databricks SQL Alerts. Each query is designed to -- return rows only when the alert condition is met. Configure each as a -- Databricks SQL Alert with the specified trigger condition. -- ============================================================================ -- ========================================================================= -- ALERT 1: Job Failure Spike -- Trigger: When failure count in last hour exceeds threshold. -- Suggested schedule: Every 15 minutes -- Trigger condition: Rows > 0 -- ========================================================================= -- Alert: job_failure_spike SELECT COUNT(*) AS failures_last_hour, 5 AS threshold, CURRENT_TIMESTAMP() AS alert_time FROM system.lakeflow.job_run_timeline WHERE period_start_time >= CURRENT_TIMESTAMP() - INTERVAL 1 HOUR AND result_state = 'FAILED' HAVING COUNT(*) > 5; -- ========================================================================= -- ALERT 2: Critical Job Failure -- Trigger: When a specific critical job fails. -- Suggested schedule: Every 5 minutes -- Trigger condition: Rows > 0 -- Customize the job_id list for your critical pipelines. -- ========================================================================= -- Alert: critical_job_failure SELECT job_id, run_id, result_state, error_message, period_start_time AS failure_time FROM system.lakeflow.job_run_timeline WHERE period_start_time >= CURRENT_TIMESTAMP() - INTERVAL 10 MINUTES AND result_state = 'FAILED' AND job_id IN ( -- Replace with your critical job IDs 0 -- placeholder ); # ... 219 more lines ...