Datadog helps detect ‘problem’ Spark and Databricks jobs

Cloud monitoring and security firm Datadog has introduced Data Jobs Monitoring, which allows teams to detect problematic Spark and Databricks jobs anywhere in their data pipelines. It also allows them to remediate failed and long-running-jobs faster, and optimize over-provisioned compute resources to reduce costs, promises the provider.

Matt Camilli.

Jobs Monitoring is said to immediately surface specific jobs that need optimization and reliability improvements, while enabling teams to drill down into job execution traces so that they can correlate their job telemetry to their cloud infrastructure for “fast debugging.”

On the technology, Matt Camilli, head of engineering at Rhythm Energy, said: “My team is able to resolve our Databricks job failures 20 percent faster, because of how easy it is to set up real-time alerting and find the root cause of the failing job.”

“When data pipelines fail, data quality is impacted, which can hurt stakeholder trust and slow down decision making,” added Michael Whetten, VP of product at Datadog. “Data Jobs Monitoring gives data platform engineers full visibility into their largest, most expensive jobs, to help them improve data quality, optimize their pipelines and prioritize cost savings.”

Michael Whetten.

Out-of-the-box alerts immediately notify teams when jobs have failed or are running beyond automatically detected baselines, so this can be addressed before there are negative impacts to the end user experience. And recommended filters in Jobs Monitoring surface the most important issues that are impacting job and cluster health, so that they can be prioritized.

In addition, detailed trace views show teams exactly where a job failed in its execution flow, so they have the full context for faster troubleshooting. Also, multiple job runs can be compared to one another to expedite root cause analysis, and identify trends and changes in run duration, Spark performance metrics, cluster utilization and configuration.

Finally, resource utilization and Spark application metrics help teams identify ways to lower compute costs for over-provisioned clusters and optimize inefficient job runs.

A Gartner magic quadrant named the leading observability and APM vendors in 2023 as Dynatrace, Datadog, New Relic, Splunk, and Honeycomb. There were 14 other vendors mentioned in the MQ.