AI Operations

AI Agent Observability: The Tools You Actually Need in Production

Most teams deploy AI agents and hope they work. Here are the observability tools, metrics, and alerting strategies you actually need to run agents in production.

Chase Dillingham

Chase Dillingham

Founder & CEO, TrainMyAgent

11 min read 15 sources cited
Observability AI Monitoring Production AI LangSmith Enterprise AI
AI agent observability tools comparison dashboard

Your AI agent has been in production for three weeks. Everything looks fine.

Is it?

You don’t know. Because you’re not measuring anything. You shipped it, checked it once, and moved on to the next project. Meanwhile, latency is creeping up, your token costs doubled because someone’s prompt grew by 3,000 tokens, and 12% of responses are confidently wrong.

That’s not a hypothetical. That’s the default state of most production AI agents.

Why Traditional Monitoring Doesn’t Cut It

You already have Datadog. Or New Relic. Or Grafana. Great. Those tools tell you if your server is up, if response times are normal, and if error rates are spiking.

They tell you nothing about whether your agent’s outputs are correct.

Traditional APM monitors infrastructure. AI agent observability monitors intelligence. Different problem. Different tools.

Here’s what traditional monitoring catches:

  • Server is down (HTTP 500)
  • API is slow (latency spike)
  • Request failed (timeout, rate limit)

Here’s what it misses:

  • Agent hallucinated a company policy that doesn’t exist
  • Output quality dropped 15% after a model update
  • Token usage per request grew 40% over two months
  • The agent is routing 30% of queries to the wrong tool
  • Cost per successful interaction doubled

Google’s SRE team calls this the difference between “is it working?” and “is it working well?” (source). For AI agents, “working” isn’t binary. It’s a spectrum, and you need to know where on that spectrum you are at all times.

The Five Metrics That Actually Matter

Before picking tools, know what to measure. Everything else is dashboard decoration.

1. Latency (p50, p95, p99)

Not just “average response time.” You need percentile breakdowns.

Why: A p50 of 2 seconds and a p99 of 45 seconds means 1 in 100 users waits 45 seconds. That’s unacceptable for most use cases, but the average looks fine.

Target benchmarks:

  • Conversational agents: p95 < 3 seconds
  • Document processing: p95 < 10 seconds
  • Multi-step workflows: p95 < 30 seconds

2. Token Usage Per Interaction

This is your cost proxy. Track it per agent, per tool call, and per conversation turn.

A 2025 analysis from Helicone across 500M+ LLM requests showed that token usage variance across “identical” tasks can be 3-5x depending on prompt design and model behavior (source). If you’re not tracking this, you’re blind to your single largest variable cost.

What to track:

  • Input tokens per request
  • Output tokens per request
  • Total tokens per conversation
  • Tokens per tool call
  • Cost per interaction (tokens x model pricing)

3. Error Rate (Typed)

Not just “errors.” Categorized errors.

  • Hard errors: API failures, timeouts, rate limits. Your infrastructure is broken.
  • Soft errors: Agent couldn’t complete the task, escalated to human, returned “I don’t know.” Your agent is broken.
  • Silent errors: Agent returned a confident but wrong answer. Your trust is broken.

Silent errors are the most dangerous. The only way to catch them is automated evaluation or human review sampling. Arize’s 2025 production ML report found that silent failures account for 60% of all production AI quality issues (source).

4. Output Quality Score

You need a way to score every response. Options:

  • LLM-as-judge: Use a separate model to evaluate outputs against criteria. Fast, scalable, 80-85% correlation with human judgment (source).
  • Human evaluation sampling: Review 5-10% of interactions manually. Gold standard but doesn’t scale.
  • Task completion rate: Did the agent accomplish what the user asked? Binary but useful.
  • User feedback signals: Thumbs up/down, escalation rate, repeat queries on the same topic.

The best systems combine all four. Automated scoring catches trends. Human review catches what automation misses. Task completion measures outcomes. User feedback measures satisfaction.

5. Cost Per Successful Interaction

Not cost per request. Cost per SUCCESSFUL request.

If your agent costs $0.03 per interaction but only succeeds 70% of the time, your effective cost is $0.043 per success. That 30% failure rate doesn’t just hurt quality — it hurts economics.

Formula: Total LLM spend / Number of successfully completed tasks

Track this weekly. If it’s trending up, something is degrading.

The Tool Landscape (Honest Assessments)

LangSmith

What it is: LangChain’s observability platform. Tracing, evaluation, dataset management, and prompt playground.

Best for: Teams already using LangChain/LangGraph. Deep integration with the LangChain ecosystem.

Strengths:

  • Best-in-class trace visualization for multi-step agent workflows
  • Built-in evaluation framework with custom evaluators
  • Dataset management for regression testing
  • Prompt versioning and A/B testing

Weaknesses:

  • Tightly coupled to LangChain. If you’re using raw API calls or another framework, the integration is clunkier.
  • Pricing scales with trace volume. Gets expensive at high throughput ($39/seat plus usage-based tracing fees) (source).
  • Self-hosted option requires significant infrastructure.

Verdict: The default choice if you’re in the LangChain ecosystem. Worth the investment for teams running complex multi-step agents.

Langfuse

What it is: Open-source LLM observability platform. Tracing, scoring, prompt management.

Best for: Teams that want full control. Self-hosted option is genuinely good.

Strengths:

  • Open source with a solid self-hosted deployment path
  • Framework agnostic — works with LangChain, LlamaIndex, raw OpenAI, Anthropic, whatever
  • Clean tracing UI
  • Built-in evaluation scoring
  • Generous free tier for the cloud version

Weaknesses:

  • Smaller team than LangSmith. Feature development is slower.
  • Enterprise features (SSO, advanced RBAC) only on paid tiers
  • Less mature evaluation framework than LangSmith

Pricing: Free tier up to 50K observations/month. Pro at $59/month. Self-hosted is free forever (source).

Verdict: Best option for teams that want open-source, framework-agnostic observability without vendor lock-in. The self-hosted path is a real differentiator.

Helicone

What it is: LLM proxy and observability platform. Sits between your code and the LLM API.

Best for: Teams that want cost monitoring and don’t want to change their code.

Strengths:

  • One-line integration. Change your base URL, done.
  • Excellent cost dashboards. Best-in-class spend visibility.
  • Request caching to reduce redundant API calls (saves 10-30% for many use cases)
  • Rate limiting and key management

Weaknesses:

  • Proxy architecture adds ~50ms latency per request
  • Less sophisticated tracing than LangSmith/Langfuse for multi-step workflows
  • Evaluation features are basic compared to dedicated eval tools

Pricing: Free up to 100K requests/month. Growth at $80/month (source).

Verdict: Best cost monitoring in the space. Use it alongside a tracing tool, not instead of one.

Arize Phoenix

What it is: ML observability platform with strong LLM support. Drift detection, evaluation, and tracing.

Best for: Teams with ML engineering backgrounds who want statistical rigor.

Strengths:

  • Best drift detection for LLM applications
  • Strong embedding visualization (useful for RAG quality monitoring)
  • Open-source Phoenix library for local evaluation
  • Integrates with existing ML monitoring workflows

Weaknesses:

  • Steeper learning curve than LangSmith or Langfuse
  • Enterprise pricing is not transparent (sales call required)
  • More complex setup for simple use cases

Verdict: Best choice for teams that need serious drift detection and have ML engineering capacity. Overkill for simple chatbot monitoring.

Custom Dashboards

What it is: Build your own with Grafana, Prometheus, and custom logging.

Best for: Teams with strong DevOps and specific requirements.

Strengths:

  • Total control
  • No vendor lock-in
  • Can integrate with existing monitoring stack
  • No per-seat or per-request fees

Weaknesses:

  • 80-200 engineering hours to build something comparable to the above tools
  • You maintain it forever
  • You miss features you didn’t know you needed

Verdict: Build custom dashboards for metrics specific to your business. Don’t build a general-purpose LLM observability platform. That’s reinventing the wheel.

The Stack We Recommend

For most production agent deployments, here’s what works:

Tier 1 (Minimum viable):

  • Langfuse OR LangSmith for tracing and evaluation
  • Built-in cost tracking (most tracing tools include this)
  • PagerDuty or Opsgenie for alerting

Tier 2 (Production-grade):

  • LangSmith or Langfuse for tracing
  • Helicone for cost monitoring and caching
  • Custom Grafana dashboards for business metrics
  • Automated eval pipeline (LLM-as-judge + human review sampling)

Tier 3 (Enterprise-scale):

  • LangSmith for tracing and eval
  • Helicone for cost and caching
  • Arize for drift detection
  • Custom dashboards for executive reporting
  • Dedicated alerting with escalation paths

You don’t need Tier 3 on day one. Start with Tier 1. Move up when your agent portfolio grows.

Setting Up Alerting That Works

Most teams set up alerting wrong. They either alert on everything (alert fatigue) or nothing (blind).

Alert on these. Nothing else.

Critical (page someone):

  • Agent availability drops below 99%
  • Error rate exceeds 5% over 15-minute window
  • Latency p95 exceeds 3x baseline
  • Cost per hour exceeds 2x daily average

Warning (Slack notification):

  • Output quality score drops below threshold for 24 hours
  • Token usage per interaction increases 20%+ week-over-week
  • New error type appears that wasn’t in the last 30 days
  • Model provider announces deprecation of your active model

Weekly review (dashboard):

  • Quality trends
  • Cost trends
  • Usage patterns
  • Drift indicators

The critical alerts should fire rarely. If they’re firing daily, your thresholds are wrong or your agent has systemic issues that alerting won’t fix.

What TMA Monitors for Every Agent

Every agent we deploy gets the same monitoring baseline:

  • Real-time: Availability, latency percentiles, error rates, cost per interaction
  • Daily: Output quality scores (LLM-as-judge), task completion rates, token usage trends
  • Weekly: Cost optimization review, drift indicators, prompt performance regression
  • Monthly: Full quality audit with human evaluation sampling, security review, cost forecast

We use a combination of Langfuse (self-hosted in client infrastructure) and custom dashboards built on Grafana. Client data never leaves their environment. Monitoring data stays where the agent runs.

That’s not a premium add-on. That’s the baseline. Because an unmonitored agent is a liability, not an asset.

The Cost of Not Monitoring

Let me put a number on it.

An agent processing 10,000 interactions/month with a 5% silent error rate that goes undetected for 3 months:

  • 500 wrong answers per month x 3 months = 1,500 incorrect interactions
  • If each interaction represents a $50 customer touchpoint: $75,000 in potential customer impact
  • If 10% of those result in support escalations at $25 each: $3,750 in direct support costs
  • Reputation damage: incalculable

Compare that to $200-$500/month for a proper observability stack.

The math isn’t close.

Getting Started

If your agent is in production right now with no observability:

  1. Today: Add Langfuse (free tier, one-line SDK integration). Start collecting traces.
  2. This week: Set up three critical alerts: availability, error rate, latency.
  3. This month: Build an automated eval pipeline. Even a simple LLM-as-judge that scores 100% of responses on basic criteria.
  4. Next month: Add cost monitoring. Set budgets. Review weekly.

Four steps. Four weeks. You’ll go from blind to informed. That’s the difference between running an agent and operating one.


Three Ways to Work With TMA

Need an agent built? We deploy production AI agents in your infrastructure. Working pilot. Real data. Measurable ROI. → Schedule Demo

Want to co-build a product? We’re not a dev agency. We’re co-builders. Shared cost. Shared upside. → Partner with Us

Want to join the Guild? Ship pilots, earn bounties, share profit. Community + equity + path to exit. → Become an AI Architect

Need this implemented?

We design and deploy enterprise AI agents in your environment with measurable ROI and production guardrails.

About the Author

Chase Dillingham

Chase Dillingham

Founder & CEO, TrainMyAgent

Chase Dillingham builds AI agent platforms that deliver measurable ROI. Former enterprise architect with 15+ years deploying production systems.