Building a Multi-Agent Log Analytics Platform From Scratch

Most log analytics tools do one thing: search. You type a keyword, you get matching lines. If you’re lucky, there’s a dashboard with some charts.

But what happens when you have 10,000 logs per day across multiple environments, and you need to know not just what errors are happening, but why they cluster, when they spike, and who should be notified - automatically?

That’s the system we built. Four autonomous agents, an event-driven pipeline, and a decision engine that routes each log to the right workflow in under 2 seconds. No company names here - just the architecture and the reasoning behind every decision.

The Problem With Traditional Log Analysis

Traditional log tools treat every log line the same way: ingest, index, search. The human does all the thinking.

But logs aren’t equal. A critical database failure in production needs immediate attention. A debug message from a test environment needs nothing. A slow API response that happens once is noise. The same slow response happening 50 times in an hour is a pattern.

The real work isn’t finding logs. It’s understanding what they mean - grouping similar errors, detecting anomalies, and routing alerts to the right people without drowning them in noise.

The Architecture: Four Agents, One Pipeline

Instead of one monolithic processor, the system uses four specialized agents connected by a Redis pub/sub event bus. Each agent does one job well and passes the result to the next.

Event-Driven Agent Pipeline

Triage

Classify, prioritize, detect format

Mapper

Extract source, enrich context

Analysis

Group errors, detect anomalies

Notifier

Route alerts, prevent fatigue

Connected via Redis pub/sub event bus

When a log is ingested, it triggers log.ingested. The Triage agent picks it up, classifies it, and emits log.triaged. The Mapper enriches it and emits log.mapped. And so on. Each agent is independent - if one fails, the others continue processing.

Let me walk through each agent.

Agent 1: The Triage Agent

The Triage agent is the first responder. Its job: figure out what this log is before anyone else touches it.

What it detects:

Severity - 50+ regex patterns covering critical, error, warning, info, debug. Handles syslog levels (RFC 3164), Java stacktraces, Python tracebacks, and .NET exceptions.
Format - Is this JSON, XML, syslog, Apache combined, nginx, or key-value? 8 detection patterns.
Timestamp - Parses 7 different formats including ISO 8601, epoch (seconds and milliseconds), RFC 3339, and Apache format.
Priority - A 1-10 score. Base comes from severity (critical=10, error=8, warning=5, info=2, debug=1). Bonuses for security keywords (+2), database/performance indicators (+2), and production environment (+1).

Source extraction is hybrid: 6 regex patterns try to identify the source service first (covering .NET namespaces, bracketed service names, hostnames, Kubernetes pod names). If regex fails, it falls back to an LLM - but only after sanitizing the log through Presidio to strip PII before it touches the prompt.

Why hybrid extraction? Regex is fast and cheap. LLMs are slow and expensive. But regex can't handle every log format in existence. The hybrid approach means 80% of logs are classified in microseconds by regex, and only the unusual ones hit the LLM. This keeps costs down while handling edge cases.

Agent 2: The Mapper Agent

The Mapper takes the triaged log and enriches it with context. What project does this log belong to? What environment? What service?

This is where the system starts building a picture beyond individual log lines. The Mapper connects each log to its organizational context - project metadata, deployment environment, service ownership.

It also handles the decision about how this log should be processed.

The Decision Tree: Five Workflows

Not every log deserves the same treatment. A critical production failure needs the full pipeline. A debug message from a test environment needs almost nothing.

The decision tree routes each log to one of five workflows:

FAST TRACK Critical errors, production failures. Full pipeline, 500ms budget. Every second counts.

SECURITY Auth failures, unauthorized access. Full pipeline with security-specific analysis, 2000ms budget.

PERF Performance degradation. Analysis + conditional notification if priority is 7 or above, 1500ms budget.

STANDARD Regular errors and warnings. Full pipeline at normal pace, 1000ms budget.

MINIMAL Debug and info logs. Triage and store only, skip analysis entirely. 200ms budget.

The decision rules are evaluated in priority order: critical severity always fast-tracks. Production errors with priority 8+ fast-track. Security log types route to the security workflow. And low-priority info/debug logs get the minimal treatment - no point running expensive LLM analysis on a debug print statement.

Why processing budgets? Without budgets, every log would get the same expensive treatment. The MINIMAL workflow at 200ms handles the bulk of logs (info/debug make up most traffic) while the FAST TRACK at 500ms reserves resources for what actually matters. This is how you process 10K+ logs daily without burning through your LLM budget.

Agent 3: The Analysis Agent

This is the brain of the system. Two core capabilities: semantic error grouping and anomaly detection.

Semantic Error Grouping

The same error shows up in different ways. Different timestamps, different user IDs, different stack trace depths - but fundamentally the same problem. Basic string matching misses these variations.

The Analysis agent uses vector embeddings to group errors by meaning:

Generate an embedding for the log message
Search the vector database for the top 10 most similar existing logs
If any match exceeds 0.85 similarity - assign to that existing group
If a match exceeds 0.97 similarity and belongs to the same project - force-assign (high confidence override)
If no match - create a new error group

0.85

PRIMARY

Standard grouping threshold

0.97

OVERRIDE

High-confidence same-project match

Top 10

CANDIDATES

Batch-fetched to prevent N+1

The vector search uses batch fetching for the top 10 candidates. Without this, each log would trigger 10 individual database lookups - a classic N+1 problem that would destroy performance at scale.

Anomaly Detection

The anomaly detector uses a 4-point scoring system:

New error group detected: +30 points
Spike above 3x hourly average: +40 points
Critical severity: +20 points
Production environment: +10 points

If the total score reaches 50 or above, it’s flagged as an anomaly. This means a new critical error in production (30 + 20 + 10 = 60) is always an anomaly. A spike of non-critical errors in staging (40 + 0 + 0 = 40) is not - it’s probably a deployment.

The system also tracks 24-hour trends. If the error rate grows by more than 20% when comparing the first and second halves of the day, it flags an increasing trend. This catches gradual degradation that spike detection would miss.

Agent 4: The Notifier Agent

The final agent decides who gets told, how, and how often. This is where alert fatigue gets solved or created.

Three-tier routing:

Tier 1 (Immediate email): Critical severity, critical DB/API errors, security events, anomaly score above 70, new error types, error spikes
Tier 2 (Digest): High-priority warnings, batched into hourly or daily summaries
Tier 3 (Silence): Low-priority info/debug - stored but never alerted

Rate limiting prevents alert storms. All backed by Redis:

Same exact alert: max 1 per 5 minutes
Same error group: max 5 per hour
Same severity level: max 20 per hour

Without rate limiting, a cascading failure that produces 500 identical errors would send 500 emails. With it, you get one email and a count.

The Safety Layer

Every log passes through PII detection before touching an LLM. The system uses Presidio with spaCy NER to catch emails, phone numbers, credit card numbers, SSNs, names, and IP addresses. These are masked before the log is embedded in any prompt.

There’s also injection detection - logs can contain adversarial content designed to manipulate the LLM’s analysis. The system detects these patterns before processing.

Why sanitize before embedding? If a log contains "User john.doe@company.com failed login from 192.168.1.50" - you don't want that email and IP stored in your vector database or sent to an LLM provider. PII sanitization happens pre-embedding, not post-analysis.

Observability: Watching the Watchers

A log analytics system that you can’t monitor is ironic. The entire pipeline is instrumented with Langfuse - 24 trace points covering every agent execution, every LLM call, every embedding operation.

This serves two purposes: debugging (why did this log get classified wrong?) and cost tracking (which workflows are consuming the most LLM tokens?). When you’re processing 10K+ logs daily, even small per-log cost differences compound.

What I’d Do Differently

Start with the decision tree, not the agents. We built the agents first and added the workflow routing later. Should have been the opposite. The decision tree determines what 70% of your logs skip entirely. Get that right first, then build the expensive analysis pipeline for the 30% that need it.

Batch vector operations from day one. Individual embedding API calls at 100-300ms each add up fast. Batch processing and the N+1 prevention pattern should be in the first version, not added after you notice latency issues.

Rate limiting is the notification feature. We initially treated it as a safety net. In practice, the rate limiter *is* the feature that makes the notification system usable. Without it, alert fatigue makes the entire system worthless regardless of how good the analysis is.

The Building Blocks

If you want to build something similar:

Component	Role	Why
Event bus (Redis pub/sub)	Agent communication	Decouples agents, enables independent scaling
Vector database	Semantic error grouping	Groups errors by meaning, not string matching
LLM provider	Classification, source extraction	Handles edge cases regex can’t
PII detection	Pre-LLM sanitization	Logs contain sensitive data - always
Task queue	Background processing	Ingestion can’t block the dashboard
Observability platform	Tracing and cost tracking	You need to monitor the monitor

The pattern is broadly applicable: any system processing high-volume events benefits from workflow-based routing, semantic grouping, and rate-limited notifications. The specific domain is logs, but the architecture works for support tickets, security events, or any streaming data where you need to separate signal from noise.

Questions about the architecture? Reach out via LinkedIn or email.