Building a Multi-Agent Log Analytics Platform From Scratch
How we designed a 4-agent event-driven pipeline that processes 10K+ daily logs with semantic error grouping, anomaly detection, and intelligent workflow routing.
Most log analytics tools do one thing: search. You type a keyword, you get matching lines. If you’re lucky, there’s a dashboard with some charts.
But what happens when you have 10,000 logs per day across multiple environments, and you need to know not just what errors are happening, but why they cluster, when they spike, and who should be notified - automatically?
That’s the system we built. Four autonomous agents, an event-driven pipeline, and a decision engine that routes each log to the right workflow in under 2 seconds. No company names here - just the architecture and the reasoning behind every decision.
The Problem With Traditional Log Analysis
Traditional log tools treat every log line the same way: ingest, index, search. The human does all the thinking.
But logs aren’t equal. A critical database failure in production needs immediate attention. A debug message from a test environment needs nothing. A slow API response that happens once is noise. The same slow response happening 50 times in an hour is a pattern.
The real work isn’t finding logs. It’s understanding what they mean - grouping similar errors, detecting anomalies, and routing alerts to the right people without drowning them in noise.
The Architecture: Four Agents, One Pipeline
Instead of one monolithic processor, the system uses four specialized agents connected by a Redis pub/sub event bus. Each agent does one job well and passes the result to the next.
When a log is ingested, it triggers log.ingested. The Triage agent picks it up, classifies it, and emits log.triaged. The Mapper enriches it and emits log.mapped. And so on. Each agent is independent - if one fails, the others continue processing.
Let me walk through each agent.
Agent 1: The Triage Agent
The Triage agent is the first responder. Its job: figure out what this log is before anyone else touches it.
What it detects:
- Severity - 50+ regex patterns covering critical, error, warning, info, debug. Handles syslog levels (RFC 3164), Java stacktraces, Python tracebacks, and .NET exceptions.
- Format - Is this JSON, XML, syslog, Apache combined, nginx, or key-value? 8 detection patterns.
- Timestamp - Parses 7 different formats including ISO 8601, epoch (seconds and milliseconds), RFC 3339, and Apache format.
- Priority - A 1-10 score. Base comes from severity (critical=10, error=8, warning=5, info=2, debug=1). Bonuses for security keywords (+2), database/performance indicators (+2), and production environment (+1).
Source extraction is hybrid: 6 regex patterns try to identify the source service first (covering .NET namespaces, bracketed service names, hostnames, Kubernetes pod names). If regex fails, it falls back to an LLM - but only after sanitizing the log through Presidio to strip PII before it touches the prompt.
Agent 2: The Mapper Agent
The Mapper takes the triaged log and enriches it with context. What project does this log belong to? What environment? What service?
This is where the system starts building a picture beyond individual log lines. The Mapper connects each log to its organizational context - project metadata, deployment environment, service ownership.
It also handles the decision about how this log should be processed.
The Decision Tree: Five Workflows
Not every log deserves the same treatment. A critical production failure needs the full pipeline. A debug message from a test environment needs almost nothing.
The decision tree routes each log to one of five workflows:
The decision rules are evaluated in priority order: critical severity always fast-tracks. Production errors with priority 8+ fast-track. Security log types route to the security workflow. And low-priority info/debug logs get the minimal treatment - no point running expensive LLM analysis on a debug print statement.
Agent 3: The Analysis Agent
This is the brain of the system. Two core capabilities: semantic error grouping and anomaly detection.
Semantic Error Grouping
The same error shows up in different ways. Different timestamps, different user IDs, different stack trace depths - but fundamentally the same problem. Basic string matching misses these variations.
The Analysis agent uses vector embeddings to group errors by meaning:
- Generate an embedding for the log message
- Search the vector database for the top 10 most similar existing logs
- If any match exceeds 0.85 similarity - assign to that existing group
- If a match exceeds 0.97 similarity and belongs to the same project - force-assign (high confidence override)
- If no match - create a new error group
The vector search uses batch fetching for the top 10 candidates. Without this, each log would trigger 10 individual database lookups - a classic N+1 problem that would destroy performance at scale.
Anomaly Detection
The anomaly detector uses a 4-point scoring system:
- New error group detected: +30 points
- Spike above 3x hourly average: +40 points
- Critical severity: +20 points
- Production environment: +10 points
If the total score reaches 50 or above, it’s flagged as an anomaly. This means a new critical error in production (30 + 20 + 10 = 60) is always an anomaly. A spike of non-critical errors in staging (40 + 0 + 0 = 40) is not - it’s probably a deployment.
The system also tracks 24-hour trends. If the error rate grows by more than 20% when comparing the first and second halves of the day, it flags an increasing trend. This catches gradual degradation that spike detection would miss.
Agent 4: The Notifier Agent
The final agent decides who gets told, how, and how often. This is where alert fatigue gets solved or created.
Three-tier routing:
- Tier 1 (Immediate email): Critical severity, critical DB/API errors, security events, anomaly score above 70, new error types, error spikes
- Tier 2 (Digest): High-priority warnings, batched into hourly or daily summaries
- Tier 3 (Silence): Low-priority info/debug - stored but never alerted
Rate limiting prevents alert storms. All backed by Redis:
- Same exact alert: max 1 per 5 minutes
- Same error group: max 5 per hour
- Same severity level: max 20 per hour
Without rate limiting, a cascading failure that produces 500 identical errors would send 500 emails. With it, you get one email and a count.
The Safety Layer
Every log passes through PII detection before touching an LLM. The system uses Presidio with spaCy NER to catch emails, phone numbers, credit card numbers, SSNs, names, and IP addresses. These are masked before the log is embedded in any prompt.
There’s also injection detection - logs can contain adversarial content designed to manipulate the LLM’s analysis. The system detects these patterns before processing.
Observability: Watching the Watchers
A log analytics system that you can’t monitor is ironic. The entire pipeline is instrumented with Langfuse - 24 trace points covering every agent execution, every LLM call, every embedding operation.
This serves two purposes: debugging (why did this log get classified wrong?) and cost tracking (which workflows are consuming the most LLM tokens?). When you’re processing 10K+ logs daily, even small per-log cost differences compound.
What I’d Do Differently
The Building Blocks
If you want to build something similar:
| Component | Role | Why |
|---|---|---|
| Event bus (Redis pub/sub) | Agent communication | Decouples agents, enables independent scaling |
| Vector database | Semantic error grouping | Groups errors by meaning, not string matching |
| LLM provider | Classification, source extraction | Handles edge cases regex can’t |
| PII detection | Pre-LLM sanitization | Logs contain sensitive data - always |
| Task queue | Background processing | Ingestion can’t block the dashboard |
| Observability platform | Tracing and cost tracking | You need to monitor the monitor |
The pattern is broadly applicable: any system processing high-volume events benefits from workflow-based routing, semantic grouping, and rate-limited notifications. The specific domain is logs, but the architecture works for support tickets, security events, or any streaming data where you need to separate signal from noise.
Questions about the architecture? Reach out via LinkedIn or email.
Related Posts
I Stopped Using Basic RAG in 2026. Here's What Replaced It.
Basic RAG is a one-trick pony. I built a hybrid system that fuses vector search, graph databases, and intelligent routing - with production security. Here's the architecture, explained simply.
Building Enterprise Agentic AI Systems: Lessons from Production Voice Agents
A deep dive into architecting production-grade conversational AI systems with real-time voice, LLM-powered analysis, and intelligent automation. Insights from building AI call center platforms.