Senior AI Engineer - APM Experiences
VerifiedAbout the Role
<h2><strong>The opportunity</strong></h2> <p>Datadog’s APM Experiences team owns the core product experience for <a href="https://docs.datadoghq.com/tracing/">Application Performance Monitoring</a> — including distributed tracing, service representation, and more. We’re building a new wave of AI-powered capabilities that help customers <em>detect, resolve, and prevent</em> performance issues faster. In this role, you will lead end‑to‑end development of LLM- and Agent‑based features that can:</p> <ul> <li>Debug and investigate application performance issues down to the root cause, as both a developer assistant and a fully autonomous agent</li> <li>Proactively recommend performance and reliability-based optimizations to prevent the next incident</li> <li>Automatically create intelligent monitors and SLOs for the most important business flows and critical paths</li> </ul> <p>This is a highly product‑minded engineering role: you’ll work from problem discovery and UX all the way to reliable, scalable production systems.</p> <h2><strong>What you’ll do</strong></h2> <ul> <li><strong>Shape AI experiences for APM.</strong> Design and ship LLM/agentic workflows that analyze traces, metrics, logs, and other telemetry to generate diagnoses, explanations, and guided fixes.</li> <li><strong>Own the full loop.</strong> Prototype quickly, define success metrics and evals, run experiments, iterate, and ultimately productionize for scale and reliability.</li> <li><strong>Build robust agent systems.</strong> Develop tools, retrieval and planning strategies, and guardrails; manage prompts/evals; design fallbacks and human‑in‑the‑loop paths.</li> <li><strong>Integrate with Datadog’s platform.</strong> Leverage surfaces like Trace Explorer, Service Catalog, monitors, and workflows to deliver end‑to‑end value in the APM UI.</li> <li><strong>Partner deeply.</strong> Collaborate with PM, Design, and partner teams to build cohesive experiences.</li> <li><strong>Raise the bar on engineering.</strong> Write performant, maintainable backend code, own services in production, and improve reliability for high‑throughput, low‑latency data systems.</li> </ul> <h2><strong>Who you are</strong></h2> <p><strong>Product‑minded engineer who ships AI to production</strong></p> <ul> <li>4+ years building backend or real-time ML systems; you value simplicity, correctness, and performance</li> <li>Proven experience delivering LLM/agent features to production (prompting, tooling, evals, safety/guardrails)</li> <li>Comfortable owning user journeys, iterating from prototype → alpha → GA, and measuring impact with clear product metrics</li> <li>You have demonstrated ability to use AI coding tools in day-to-day workflows and validate, critique, and refine AI-generated output</li> <li>You’re motivated to push the boundaries of how AI can improve software engineering best practices and contribute to building AI-enabled products</li> </ul> <p><strong>Strong ML / applied science fundamentals</strong></p> <ul> <li>Solid grasp of the ML lifecycle (task definition, dataset collection, modeling, evaluation, deployment, iteration) and statistics (experiment design, confidence intervals)</li> <li>Experience choosing/modeling the right technique for the job (e.g., anomaly detection, ranking/recommendation, NLP), and knowing when a heuristic beats a model</li> <li>Fluency with offline/online evals for AI systems; can build reliable golden sets and automatic regressions</li> </ul> <p><strong>Distributed systems & observability savvy</strong></p> <ul> <li>Experience with microservices performance: tracing, latency breakdowns, concurrency, and resiliency patterns</li> <li>Proficient in Go, Java, or Python; strong API/service design; production ops (monitoring, alerting, on‑call rotation)</li> </ul> <p><strong>Nice to have</strong></p> <ul> <li>Hands‑on with distributed tracing stacks (OpenTelemetry/Datadog APM), profilers, and logs/metrics pipelines</li> <li>Exposure to planning/agent frameworks, tool‑use orchestration, RAG, and retrieval/indexing for observability data</li> <li>Familiarity with SLO/SLA practices and incident response</li> </ul> <h4>Benefits and Growth:</h4> <ul> <li>Get to build tools for software engineers, just like yourself. And use the tools we build to accelerate our development.</li> <
Related Searches
Explore more opportunities matching this role's title, location, and skills.
Ready to apply?
Click below to apply directly on datadog's careers page.
Get the top 10 hyper-growth roles delivered to your inbox every Tuesday.