AI Agent Observability: The Tools You Actually Need in Production

Your AI agent has been in production for three weeks. Everything looks fine.

Is it?

You don’t know. Because you’re not measuring anything. You shipped it, checked it once, and moved on to the next project. Meanwhile, latency is creeping up, your token costs doubled because someone’s prompt grew by 3,000 tokens, and 12% of responses are confidently wrong.

That’s not a hypothetical. That’s the default state of most production AI agents.

Why Traditional Monitoring Doesn’t Cut It

You already have Datadog. Or New Relic. Or Grafana. Great. Those tools tell you if your server is up, if response times are normal, and if error rates are spiking.

They tell you nothing about whether your agent’s outputs are correct.

Traditional APM monitors infrastructure. AI agent observability monitors intelligence. Different problem. Different tools.

Here’s what traditional monitoring catches:

Server is down (HTTP 500)
API is slow (latency spike)
Request failed (timeout, rate limit)

Here’s what it misses:

Agent hallucinated a company policy that doesn’t exist
Output quality dropped 15% after a model update
Token usage per request grew 40% over two months
The agent is routing 30% of queries to the wrong tool
Cost per successful interaction doubled

Google’s SRE team calls this the difference between “is it working?” and “is it working well?” (source). For AI agents, “working” isn’t binary. It’s a spectrum, and you need to know where on that spectrum you are at all times.

The Five Metrics That Actually Matter

Before picking tools, know what to measure. Everything else is dashboard decoration.

1. Latency (p50, p95, p99)

Not just “average response time.” You need percentile breakdowns.

Why: A p50 of 2 seconds and a p99 of 45 seconds means 1 in 100 users waits 45 seconds. That’s unacceptable for most use cases, but the average looks fine.

Target benchmarks:

Conversational agents: p95 < 3 seconds
Document processing: p95 < 10 seconds
Multi-step workflows: p95 < 30 seconds

2. Token Usage Per Interaction

This is your cost proxy. Track it per agent, per tool call, and per conversation turn.

A 2025 analysis from Helicone across 500M+ LLM requests showed that token usage variance across “identical” tasks can be 3-5x depending on prompt design and model behavior (source). If you’re not tracking this, you’re blind to your single largest variable cost.

What to track:

Input tokens per request
Output tokens per request
Total tokens per conversation
Tokens per tool call
Cost per interaction (tokens x model pricing)

3. Error Rate (Typed)

Not just “errors.” Categorized errors.

Hard errors: API failures, timeouts, rate limits. Your infrastructure is broken.
Soft errors: Agent couldn’t complete the task, escalated to human, returned “I don’t know.” Your agent is broken.
Silent errors: Agent returned a confident but wrong answer. Your trust is broken.

Silent errors are the most dangerous. The only way to catch them is automated evaluation or human review sampling. Arize’s 2025 production ML report found that silent failures account for 60% of all production AI quality issues (source).

4. Output Quality Score

You need a way to score every response. Options:

LLM-as-judge: Use a separate model to evaluate outputs against criteria. Fast, scalable, 80-85% correlation with human judgment (source).
Human evaluation sampling: Review 5-10% of interactions manually. Gold standard but doesn’t scale.
Task completion rate: Did the agent accomplish what the user asked? Binary but useful.
User feedback signals: Thumbs up/down, escalation rate, repeat queries on the same topic.

The best systems combine all four. Automated scoring catches trends. Human review catches what automation misses. Task completion measures outcomes. User feedback measures satisfaction.

5. Cost Per Successful Interaction

Not cost per request. Cost per SUCCESSFUL request.

If your agent costs $0.03 per interaction but only succeeds 70% of the time, your effective cost is $0.043 per success. That 30% failure rate doesn’t just hurt quality — it hurts economics.

Formula: Total LLM spend / Number of successfully completed tasks

Track this weekly. If it’s trending up, something is degrading.

The Tool Landscape (Honest Assessments)

LangSmith

What it is: LangChain’s observability platform. Tracing, evaluation, dataset management, and prompt playground.

Best for: Teams already using LangChain/LangGraph. Deep integration with the LangChain ecosystem.

Strengths:

Best-in-class trace visualization for multi-step agent workflows
Built-in evaluation framework with custom evaluators
Dataset management for regression testing
Prompt versioning and A/B testing

Weaknesses:

Tightly coupled to LangChain. If you’re using raw API calls or another framework, the integration is clunkier.
Pricing scales with trace volume. Gets expensive at high throughput ($39/seat plus usage-based tracing fees) (source).
Self-hosted option requires significant infrastructure.

Verdict: The default choice if you’re in the LangChain ecosystem. Worth the investment for teams running complex multi-step agents.

Langfuse

What it is: Open-source LLM observability platform. Tracing, scoring, prompt management.

Best for: Teams that want full control. Self-hosted option is genuinely good.

Strengths:

Open source with a solid self-hosted deployment path
Framework agnostic — works with LangChain, LlamaIndex, raw OpenAI, Anthropic, whatever
Clean tracing UI
Built-in evaluation scoring
Generous free tier for the cloud version

Weaknesses:

Smaller team than LangSmith. Feature development is slower.
Enterprise features (SSO, advanced RBAC) only on paid tiers
Less mature evaluation framework than LangSmith

Pricing: Free tier up to 50K observations/month. Pro at $59/month. Self-hosted is free forever (source).

Verdict: Best option for teams that want open-source, framework-agnostic observability without vendor lock-in. The self-hosted path is a real differentiator.

Helicone

What it is: LLM proxy and observability platform. Sits between your code and the LLM API.

Best for: Teams that want cost monitoring and don’t want to change their code.

Strengths:

One-line integration. Change your base URL, done.
Excellent cost dashboards. Best-in-class spend visibility.
Request caching to reduce redundant API calls (saves 10-30% for many use cases)
Rate limiting and key management

Weaknesses:

Proxy architecture adds ~50ms latency per request
Less sophisticated tracing than LangSmith/Langfuse for multi-step workflows
Evaluation features are basic compared to dedicated eval tools

Pricing: Free up to 100K requests/month. Growth at $80/month (source).

Verdict: Best cost monitoring in the space. Use it alongside a tracing tool, not instead of one.

Arize Phoenix

What it is: ML observability platform with strong LLM support. Drift detection, evaluation, and tracing.

Best for: Teams with ML engineering backgrounds who want statistical rigor.

Strengths:

Best drift detection for LLM applications
Strong embedding visualization (useful for RAG quality monitoring)
Open-source Phoenix library for local evaluation
Integrates with existing ML monitoring workflows

Weaknesses:

Steeper learning curve than LangSmith or Langfuse
Enterprise pricing is not transparent (sales call required)
More complex setup for simple use cases

Verdict: Best choice for teams that need serious drift detection and have ML engineering capacity. Overkill for simple chatbot monitoring.

Custom Dashboards

What it is: Build your own with Grafana, Prometheus, and custom logging.

Best for: Teams with strong DevOps and specific requirements.

Strengths:

Total control
No vendor lock-in
Can integrate with existing monitoring stack
No per-seat or per-request fees

Weaknesses:

80-200 engineering hours to build something comparable to the above tools
You maintain it forever
You miss features you didn’t know you needed

Verdict: Build custom dashboards for metrics specific to your business. Don’t build a general-purpose LLM observability platform. That’s reinventing the wheel.

For most production agent deployments, here’s what works:

Tier 1 (Minimum viable):

Langfuse OR LangSmith for tracing and evaluation
Built-in cost tracking (most tracing tools include this)
PagerDuty or Opsgenie for alerting

Tier 2 (Production-grade):

LangSmith or Langfuse for tracing
Helicone for cost monitoring and caching
Custom Grafana dashboards for business metrics
Automated eval pipeline (LLM-as-judge + human review sampling)

Tier 3 (Enterprise-scale):

LangSmith for tracing and eval
Helicone for cost and caching
Arize for drift detection
Custom dashboards for executive reporting
Dedicated alerting with escalation paths

You don’t need Tier 3 on day one. Start with Tier 1. Move up when your agent portfolio grows.

Setting Up Alerting That Works

Most teams set up alerting wrong. They either alert on everything (alert fatigue) or nothing (blind).

Alert on these. Nothing else.

Critical (page someone):

Agent availability drops below 99%
Error rate exceeds 5% over 15-minute window
Latency p95 exceeds 3x baseline
Cost per hour exceeds 2x daily average

Warning (Slack notification):

Output quality score drops below threshold for 24 hours
Token usage per interaction increases 20%+ week-over-week
New error type appears that wasn’t in the last 30 days
Model provider announces deprecation of your active model

Weekly review (dashboard):

Quality trends
Cost trends
Usage patterns
Drift indicators

The critical alerts should fire rarely. If they’re firing daily, your thresholds are wrong or your agent has systemic issues that alerting won’t fix.

What TMA Monitors for Every Agent

Every agent we deploy gets the same monitoring baseline:

Real-time: Availability, latency percentiles, error rates, cost per interaction
Daily: Output quality scores (LLM-as-judge), task completion rates, token usage trends
Weekly: Cost optimization review, drift indicators, prompt performance regression
Monthly: Full quality audit with human evaluation sampling, security review, cost forecast

We use a combination of Langfuse (self-hosted in client infrastructure) and custom dashboards built on Grafana. Client data never leaves their environment. Monitoring data stays where the agent runs.

That’s not a premium add-on. That’s the baseline. Because an unmonitored agent is a liability, not an asset.

The Cost of Not Monitoring

Let me put a number on it.

An agent processing 10,000 interactions/month with a 5% silent error rate that goes undetected for 3 months:

500 wrong answers per month x 3 months = 1,500 incorrect interactions
If each interaction represents a $50 customer touchpoint: $75,000 in potential customer impact
If 10% of those result in support escalations at $25 each: $3,750 in direct support costs
Reputation damage: incalculable

Compare that to $200-$500/month for a proper observability stack.

The math isn’t close.

Getting Started

If your agent is in production right now with no observability:

Today: Add Langfuse (free tier, one-line SDK integration). Start collecting traces.
This week: Set up three critical alerts: availability, error rate, latency.
This month: Build an automated eval pipeline. Even a simple LLM-as-judge that scores 100% of responses on basic criteria.
Next month: Add cost monitoring. Set budgets. Review weekly.

Four steps. Four weeks. You’ll go from blind to informed. That’s the difference between running an agent and operating one.

Three Ways to Work With TMA

Need an agent built? We deploy production AI agents in your infrastructure. Working pilot. Real data. Measurable ROI. → Schedule Demo

Want to co-build a product? We’re not a dev agency. We’re co-builders. Shared cost. Shared upside. → Partner with Us

Want to join the Guild? Ship pilots, earn bounties, share profit. Community + equity + path to exit. → Become an AI Architect

AI Agent Observability: The Tools You Actually Need in Production

Why Traditional Monitoring Doesn’t Cut It

The Five Metrics That Actually Matter

1. Latency (p50, p95, p99)

2. Token Usage Per Interaction

3. Error Rate (Typed)

4. Output Quality Score

5. Cost Per Successful Interaction

The Tool Landscape (Honest Assessments)

LangSmith

Langfuse

Helicone

Arize Phoenix

Custom Dashboards

Setting Up Alerting That Works

What TMA Monitors for Every Agent

The Cost of Not Monitoring

Getting Started

Three Ways to Work With TMA

Need this implemented?

About the Author

Why Traditional Monitoring Doesn’t Cut It

The Five Metrics That Actually Matter

1. Latency (p50, p95, p99)

2. Token Usage Per Interaction

3. Error Rate (Typed)

4. Output Quality Score

5. Cost Per Successful Interaction

The Tool Landscape (Honest Assessments)

LangSmith

Langfuse

Helicone

Arize Phoenix

Custom Dashboards

The Stack We Recommend

Setting Up Alerting That Works

What TMA Monitors for Every Agent

The Cost of Not Monitoring

Getting Started

Three Ways to Work With TMA

Need this implemented?

About the Author