Trace
Real-time safety for knowledge agents
Three endpoints. One pipeline. Score groundedness, redact PII, and compress context — then turn the results into autonomous decisions and a continuous improvement loop.
Groundedness
/v1/groundingPrivacy
/v1/redactCompression
/v1/compressInstall and authenticate
Install the SDK, grab an API key from the portal, and export it.
Every example below reads the key from the environment automatically.
Score, redact, compress
Three stateless endpoints — call one, two, or all three. Each returns structured data you can act on immediately.
from latence import Latence
trace = Latence() # reads LATENCE_TRACE_API_KEY from env
# --- 1. Score groundedness ---
ground = trace.grounding.rag(
response_text="Paris is the capital of France.",
raw_context="France's capital city is Paris.",
)
print(ground.score, ground.band) # 0.92, "green"
print(ground.context_unused_ratio) # 0.0 — no wasted context
# --- 2. Redact PII ---
privacy = trace.privacy.redact(text="Email me at john@acme.com")
print(privacy.redacted_text) # "Email me at [EMAIL]"
print(privacy.entity_count) # 1
# --- 3. Compress context ---
compressed = trace.compression.text(
text="Our refund policy allows returns within 30 days of purchase. "
"All items must be in original condition with receipt.",
)
print(compressed.compressed_text) # shorter, same meaning
print(compressed.tokens_saved) # 9Groundedness scoring
https://api.latence.ai/v1/groundingSend a response and its retrieval context. Trace scores every claim against the evidence, assigns a risk band, and tells you exactly how much context went unused.
Key response fields
| Field | Type | Description |
|---|---|---|
score | float | Groundedness score 0-1. Higher is better. |
band | string | Risk band: green (> 0.7), amber (0.4-0.7), or red (< 0.4). |
context_coverage_ratio | float | Fraction of the response grounded in context. |
context_unused_ratio | float | Fraction of context left unused — the retrieval-quality signal. |
context_uncertain_ratio | float | Fraction with uncertain grounding. |
nli_aggregate | float | Aggregate NLI entailment score. |
support_units | list | Per-unit verdicts: source_id, usage_state, coverage_score. |
runtime_decision | object | null | Allow / review / block decision when runtime decisioning is enabled. |
latency_ms | float | Pod-side scoring latency in milliseconds. |
request_id | string | Unique request ID for correlation and logging. |
Structured premises with SupportUnit
Pass per-unit provenance (source_id, speaker) and the scorer propagates it back onto every verdict — so you know exactly which document contributed.
from latence import Latence, SupportUnit
trace = Latence()
units = [
SupportUnit(text="Paris is the capital of France.", source_id="doc-42"),
SupportUnit(text="It sits on the Seine.", source_id="doc-42"),
{"text": "Population: 2.1M.", "source_id": "wiki"},
]
r = trace.grounding.rag(
response_text="Paris, France's capital, sits on the Seine.",
support_units=units,
)
for u in (r.support_units or []):
print(u.source_id, u.usage_state, u.coverage_score)
# doc-42 used 0.95
# doc-42 used 0.88
# wiki unused 0.12Privacy and PII redaction
https://api.latence.ai/v1/redactDetect and redact over 50 GDPR-defined entity types — emails, phone numbers, addresses, medical records, financial data — before the response reaches the user or your logs. No PII is ever stored by the gateway.
Key response fields
| Field | Type | Description |
|---|---|---|
redacted_text | string | Input text with PII replaced by type labels like [EMAIL], [PHONE]. |
entities | list | Each detected entity: label, text, start, end, score, redacted_value. |
entity_count | int | Total entities detected and redacted. |
unique_labels | list[str] | Distinct entity types found (email, phone, name, ...). |
label_mode | string | Detection mode used: open (all types) or category (selected). |
processing_time_ms | float | Server-side processing latency. |
Context compression
https://api.latence.ai/v1/compressReduce retrieved context by up to 60% without losing answer quality. Lower token cost, lower latency, same results. Critical terms like numbers, dates, and proper nouns are force-preserved.
Key response fields
| Field | Type | Description |
|---|---|---|
compressed_text | string | Semantically equivalent text with reduced token count. |
original_tokens | int | Token count before compression. |
compressed_tokens | int | Token count after compression. |
tokens_saved | int | Tokens removed. |
compression_ratio | float | compressed_tokens / original_tokens. |
compression_percentage | float | Percentage of tokens removed. |
preserved_terms | list[str] | Critical terms that were force-preserved. |
Run all three in parallel
Use AsyncLatence to fire groundedness, privacy, and compression concurrently. Total latency = slowest call, not the sum.
import asyncio
from latence import AsyncLatence
async def trace_pipeline(answer: str, context: str):
async with AsyncLatence() as trace:
ground, privacy, compressed = await asyncio.gather(
trace.grounding.rag(
response_text=answer,
raw_context=context,
),
trace.privacy.redact(text=answer),
trace.compression.text(text=context),
)
return ground, privacy, compressed
ground, privacy, compressed = asyncio.run(
trace_pipeline(
answer="The patient's refund was processed on 2024-01-15.",
context="Refund policy: 30-day window. Patient John Doe, case #4412.",
)
)
print(f"Grounded: {ground.band} ({ground.score:.2f})")
print(f"Redacted: {privacy.redacted_text}")
print(f"Compressed: {compressed.tokens_saved} tokens saved")Turn results into autonomous decisions
The response fields are routing rules, not diagnostics. Use the band to decide what your agent does next — no human in the loop required.
| Band | Score range | Action | What your agent does |
|---|---|---|---|
| green | > 0.70 | Allow | Serve the answer. Redact PII first. |
| amber | 0.40 – 0.70 | Review | Flag for human review, or retry with better context. |
| red | < 0.40 | Block | Return a safe fallback. Log the failure for analysis. |
from latence import Latence
trace = Latence()
def safe_answer(query: str, context: str, llm_answer: str) -> dict:
"""Score, redact, decide — all before the user sees anything."""
result = trace.grounding.rag(
response_text=llm_answer,
raw_context=context,
)
# Autonomous decision based on band
if result.band == "red":
return {
"action": "block",
"reason": f"Groundedness {result.score:.2f} — answer not supported by context",
"fallback": "I don't have enough information to answer that reliably.",
}
if result.band == "amber":
return {
"action": "review",
"reason": f"Groundedness {result.score:.2f} — partially supported",
"answer": llm_answer,
"flag": "human_review_required",
}
# Green band — safe to serve, redact PII first
clean = trace.privacy.redact(text=llm_answer)
return {
"action": "allow",
"answer": clean.redacted_text,
"score": result.score,
"context_used": f"{(1 - result.context_unused_ratio) * 100:.0f}%",
"entities_redacted": clean.entity_count,
}Signals → next steps
context_coverage_ratio low → the answer isn't grounded. Upgrade data quality upstream.
context_unused_ratio > 30% → your retriever ships the wrong chunks. Upgrade retrieval
Every call is automatically logged
You don't need to build a logging pipeline. The gateway records every request — score, band, latency, context ratios, cost, and entity counts — to your Trace dashboard automatically. Just add a request_id to correlate with your own systems.
from latence import Latence
trace = Latence()
def traced_rag_call(query, context, answer):
result = trace.grounding.rag(
response_text=answer,
raw_context=context,
request_id=f"prod-{query[:20]}", # your correlation ID
)
# Every call is automatically logged to your Trace dashboard.
# The gateway records: score, band, latency, context ratios,
# cost — all queryable in the Insights page.
print(f"[Trace] {result.band} | score={result.score:.3f} "
f"| unused={result.context_unused_ratio:.0%} "
f"| latency={result.latency_ms:.0f}ms")
return resultAnalyze and improve
Open the Trace Dashboard to see your production traffic in real time. Use the insights to close the loop and improve your system.
Groundedness p50/p95
Track score distribution over time. Catch regressions before users notice.
Red band rate
The percentage of answers that failed grounding. Your north-star safety metric.
Context unused ratio
How much retrieved context was dead weight. Signals retriever quality.
Entities redacted
Total PII caught by the privacy endpoint. Proves compliance at audit time.
Tokens saved
Cumulative token savings from compression. Directly maps to cost reduction.
P95 latency
End-to-end scoring latency. Stays under 200ms for most payloads.
Keep going
Trace detected a data-quality bottleneck?
context_coverage_ratio is low or bands trend amber/red, the fix is upstream: improve the retrieved context, document quality, and evidence coverage before the answer reaches users.