Home Technology Chronosphere takes on Datadog with AI that explains itself, not just outages

Chronosphere takes on Datadog with AI that explains itself, not just outages

0

Chronosphere, a New York-based startup specializing in observability solutions, unveiled new AI-powered features aimed at streamlining the diagnosis and resolution of production software failures. This development addresses a growing challenge as AI accelerates software development, increasing system complexity and making traditional debugging methods less effective.

The innovative capabilities integrate AI-driven insights with Chronosphere’s proprietary Temporal Knowledge Graph-a dynamic, continuously updated representation of an organization’s services, infrastructure dependencies, and system changes over time. This approach tackles the critical issue enterprises face today: while AI expedites code creation, troubleshooting remains a largely manual, time-consuming process that slows down incident resolution.

Accelerated Code Generation Meets Persistent Debugging Challenges

Despite AI enabling developers to write code approximately 13% faster, according to recent industry analyses, the process of debugging production issues remains labor-intensive. When critical failures occur-such as a slowdown during an e-commerce checkout or transaction errors in banking apps-engineers must manually analyze vast amounts of telemetry data, including server logs, application traces, infrastructure metrics, and recent deployment records, to pinpoint root causes.

Chronosphere’s solution, termed the Temporal Knowledge Graph, is built on four foundational elements: automated “Suggestions” that recommend data-backed investigative paths; a continuously evolving system map that captures service relationships and changes over time; Investigation Notebooks that chronicle troubleshooting steps for future reference; and natural language query capabilities that simplify data exploration.

Martin Mao, CEO and co-founder of Chronosphere, describes the Temporal Knowledge Graph as “a living, time-sensitive model of your entire system. It integrates telemetry data-metrics, traces, logs-with infrastructure context, deployment events, feature flag changes, and even human annotations like runbooks into a single, queryable framework that evolves alongside your environment.”

Temporal Context: A Distinctive Edge Over Traditional Dependency Maps

Unlike conventional service dependency maps offered by competitors such as Datadog, Dynatrace, and Splunk, Chronosphere’s Temporal Knowledge Graph incorporates the dimension of time, tracking how services and their interdependencies evolve. This temporal awareness enables the system to correlate changes directly with incidents, revealing not just what changed but why it impacted system behavior.

Moreover, Chronosphere’s technology normalizes custom, non-standard telemetry data, ensuring that application-specific signals-often overlooked by other platforms relying on standardized integrations-are fully incorporated. This comprehensive visibility reduces blind spots and enhances the accuracy of AI-driven diagnostics.

Empowering Engineers: Transparency Over Automation

Chronosphere deliberately designed its AI features to augment, not replace, human decision-making. This approach counters the “confident-but-wrong” guidance problem seen in early AI observability tools, where automated systems make decisions without sufficient transparency.

“Our AI doesn’t act autonomously behind the scenes,” Mao explains. “Instead, it presents its reasoning, offers evidence such as timing, dependencies, and error patterns, and allows engineers to verify or override suggestions. Each recommendation includes a ‘Why was this suggested?’ section, enabling users to understand the checks performed before taking action.”

For example, if a service level objective (SLO) alert triggers on a checkout process, Chronosphere might suggest investigating the Payment service, highlighting error onset and related metrics. Engineers can then explore detailed charts and reasoning within the same interface, with the system dynamically updating suggestions as the investigation narrows. This workflow eliminates the need to switch between multiple tools or tabs.

Investigation Notebooks document the entire troubleshooting journey, capturing causal chains such as a feature flag update leading to pod memory exhaustion, which in turn caused downstream service degradation. This recorded knowledge feeds back into the Temporal Knowledge Graph, accelerating resolution of similar future incidents.

Competing with Industry Giants: A Technical and Strategic Differentiator

Chronosphere operates in a competitive landscape dominated by established players like Datadog, Dynatrace, and Splunk, all of which have introduced AI-enhanced troubleshooting capabilities within their comprehensive observability suites. However, Mao emphasizes that many of these solutions rely heavily on pattern recognition and anomaly summarization, which often fall short during complex incidents.

“These platforms tend to correlate signals without deep causal analysis, resulting in impressive demos but limited real-world effectiveness,” Mao notes. “They also focus on standardized telemetry sources, missing critical insights from custom application data. This gap leads to AI models generating misleading guidance that wastes engineering time.”

Chronosphere’s focus on integrating custom telemetry and providing transparent, causally grounded AI recommendations has earned it recognition as a Leader in Gartner’s Observability Magic Quadrant for two consecutive years. Additionally, it achieved a 4.7 out of 5 rating in Gartner Peer Insights’ “Voice of the Customer” report, reflecting strong user satisfaction.

Driving Cost Efficiency Amid Rising Observability Expenses

As enterprises grapple with soaring observability costs driven by exponential growth in telemetry data, Chronosphere claims its platform can reduce data volumes and associated expenses by an average of 84%, while decreasing critical incidents by up to 75%. This cost-effectiveness is a key differentiator in a market where over 70% of observability budgets are spent storing rarely queried logs.

Real-world customer successes underscore these claims: Robinhood reported a fivefold increase in reliability and a fourfold improvement in mean time to detection; DoorDash enhanced monitoring governance; Astronomer achieved over 85% cost savings through data shaping; and Affirm successfully scaled its load tenfold during Black Friday without incident.

Mao advises CIOs to evaluate AI observability tools based on transparency, coverage of custom telemetry, and the extent to which they reduce manual toil and tool-switching, rather than relying solely on marketing claims.

Strategic Partnerships Over Monolithic Platforms

In parallel with its AI troubleshooting launch, Chronosphere introduced a Partner Program integrating five specialized vendors: Arize for large language model monitoring, Embrace for real user monitoring, Polar Signals for continuous profiling, Checkly for synthetic monitoring, and Rootly for incident management.

This composable approach contrasts with the all-in-one platforms favored by many competitors. “Global enterprises require best-in-class capabilities across observability domains,” Mao explains. “Our partnerships enable customers to leverage specialized tools seamlessly, ensuring depth and flexibility at scale.”

Partners echo this sentiment. Noah Smolen of Arize highlights the importance of integrated AI observability suites for Fortune 500 companies adopting AI rapidly, while Rootly’s CEO JJ Tang emphasizes how the integration reduces incident resolution times by over 78%, improving reliability and innovation velocity.

While customers currently manage separate contracts with these vendors, Chronosphere plans to simplify procurement by moving toward unified agreements, enhancing value and streamlining operations.

From Uber’s Halloween Outages to a Billion-Dollar Vision

Chronosphere’s founders, Martin Mao and Rob Skillington, previously led Uber’s internal observability platform, tackling critical outages during peak demand nights like Halloween and New Year’s Eve. Their solution, built on open-source technologies, enabled Uber to maintain visibility and reliability during these high-stress periods.

Recognizing that Kubernetes adoption would make Uber’s challenges common across industries, they launched Chronosphere in 2019 to address observability at scale for cloud-native environments. Since then, the company has raised over $300 million from top-tier investors and grown to nearly 300 employees across multiple U.S. offices.

Its clientele includes leading technology firms such as Robinhood, DoorDash, Affirm, and others operating large-scale Kubernetes infrastructures.

Current Availability and Future Roadmap

Chronosphere’s AI troubleshooting features, including Suggestions and Investigation Notebooks, are currently in limited release with select customers, with general availability planned for 2026. Meanwhile, the company’s AI Query API, which enables integration of Chronosphere’s observability data into internal AI workflows and development environments, is available immediately to all customers.

This phased rollout reflects a cautious strategy to ensure AI-driven guidance is reliable and genuinely accelerates incident resolution in production settings. By collecting user feedback during early adoption, Chronosphere aims to refine its algorithms and build trust among engineering teams.

Ultimately, Chronosphere’s vision for the future of observability centers on transparent AI that explains its reasoning and a modular ecosystem of best-in-class tools. In an era of increasing system complexity and data overload, the company bets that empowering engineers with clear insights and control-not opaque automation-will define success in the AI-driven observability landscape.

Exit mobile version