Anthropic Cut Their False Positive Rate to 7% With Claude Code. The Pattern Is Stealable. | Aravind Arumugam
cybersecurity11 min read
Anthropic Cut Their False Positive Rate to 7% With Claude Code. The Pattern Is Stealable.
Anthropic published a piece this morning about how their cybersecurity team built an internal SOC platform on top of Claude Code. Yes, it is a Claude Code marketing post. No, that does not mean the pattern is worthless. Jackie Bow, the technical lead, gave away enough operational detail that anyone running a security ops function should read it twice and steal the parts that apply.
I read it once for the headlines and again with my ops hat on. Here is what is real, what is hype, and what your team should actually do this week.
The headline number
Before CLUE, roughly one in three alerts in Anthropic's environment turned out to be false positives. That rate has dropped to 7 per cent.
If you run a SOC, that number should make you stop scrolling. False positive fatigue is the single most-cited reason analysts burn out and the single most-likely root cause when a real incident gets missed in a noisy queue. Cutting it by a factor of four is not a marginal improvement. It is the difference between a team that can stay focused and a team that stops trusting its own alerts.
A second number worth memorising. Over 30 days, CLUE ran 12,000 queries and 27,000 tool calls. Bow's team estimates that work would have taken 1,870 hours of analyst time. That is 234 person-days saved in a single month from one platform.
These are Anthropic's numbers in Anthropic's environment, so calibrate accordingly. But the order of magnitude is the point.
What CLUE actually is
CLUE stands for Claude Looks Up Evidence. The name is a bit cute. The architecture is not.
It is a natural-language interface over Anthropic's security data, built with Claude Code as the collaborator. It has three parts that all use the same underlying agentic pattern.
CLUE Triage runs first-pass triage on every alert before a human sees it. When an alert fires, Claude pulls context from Slack messages, internal docs, code repositories and data warehouses. It then assigns one of four dispositions (false positive, true positive, malicious, expected behaviour) with a confidence score. Analysts only see the alerts where Claude is uncertain or flags real concern.
The trick here is not the LLM. It is the tools Claude has access to. An alert without context is just a log line. The same alert seen alongside "the team flagged this exact maintenance window in #sre yesterday" is closed in two seconds. Anthropic's CLUE is plumbed into the systems where that internal context lives. That is the moat, not the model.
CLUE Investigate is the natural-language query layer over security-critical logs. Analysts ask "what are all the failed logins for this system over the past day?" and Claude turns that into SQL. An orchestrator agent breaks the question into sub-tasks, dispatches them to sub-agents that run in parallel, and synthesises the results.
The numbers Bow shares are concrete. CLUE averages 25 tool calls and 11 queries per investigation session. A human analyst would not reasonably run 11 queries on the same incident — they would write a couple, eyeball the output, and move on. CLUE runs the queries a human would write and then several more the human would not have thought to write. That is where the precision gain shows up.
The data governance use case is the one that translates most directly to other organisations. Their example: did three contractors access documents they should not have, over the past two months. That is normally a half-day of querying access logs, cross-referencing permissions, and reviewing classifications. CLUE does it in minutes with a transparent audit trail of every query.
Most teams have this exact workflow. Most teams hate it. Most teams will not touch it because the cost-per-query of human attention is too high. Move that floor and the workflow becomes routine instead of avoided.
The build timeline that the engineering crowd will fixate on
Proof of concept in a day. Full system in a week. Bow's team built CLUE as a side project while running their actual day job.
Take that with a pinch of salt. Anthropic's engineers have insider knowledge of Claude Code, direct access to model behaviour data, and presumably a higher tolerance for moving fast on internal tooling than a regulated enterprise would. But the framing is honest enough. The shape of the work has changed. Building internal tooling that used to require sprints can now happen during sprints.
If your team is sitting on a "we should automate that" list that has been stale for two years, this is the year to grind through it.
What is universally transferable
Three patterns generalise out of CLUE no matter your team size or stack.
Internal context wins. Every SOC has alerts. Few SOCs have alerts plumbed against Slack threads, design docs, runbooks, and warehouse data simultaneously. The point of CLUE is not that Claude is smart enough to triage alerts. It is that Claude can read everything humans would read if they had unlimited time. Build the integrations before you tune the model.
Parallel sub-agent investigation beats sequential query writing. CLUE's orchestrator pattern (one agent breaks down the question, dispatches sub-agents, synthesises results) is the architecture lesson worth copying. You do not need Claude Code to implement this. You can build the same shape with any decent LLM SDK and a job queue. The win is in the orchestration, not the model choice.
Audit transparency is the unlock for trust. CLUE stores the transcript of every investigation: every query, every tool call, every Claude reasoning step. That is what makes a non-deterministic system safe to use in a regulated environment. If you cannot show your auditors exactly what the model did, you cannot ship this pattern. Build the transcript layer first.
What is Anthropic-specific
Two things in the post do not translate cleanly.
They use their own model with insider tuning insight. Anthropic's team knows when Sonnet vs Opus is the right choice in a way external customers cannot easily replicate. The footnote on the post notes "results were generated using Claude Sonnet and Opus models" — which means routing decisions are baked into the system in ways that are not visible to the reader. Your model selection logic will be cruder. That is fine, just do not expect parity.
Their data is unusually clean. Anthropic is a six-year-old startup with high-trust hiring and a culture where Slack threads contain real engineering reasoning. If your organisation has thousands of legacy tickets, undocumented systems and conflicting data dictionaries, your CLUE clone will have a noisier corpus to query. The 7 per cent false positive rate is achievable but the work to get there is bigger than the Anthropic headline implies.
The Bitter Lesson application is the real intellectual punch
This is the part of the post that most security leaders will skim past and that I think is the most important paragraph of the entire piece.
Bow references Rich Sutton's Bitter Lesson — the observation from AI research that encoding human-specific reasoning into models consistently underperforms compared to giving models general capabilities and letting them find their own approaches.
Applied to security ops this is heresy. The entire SOAR generation of tools was built on the opposite principle: define every step, encode every playbook, make the response deterministic. Bow's team explicitly rejected that.
When we gave Claude latitude to explore — access to tools and a goal, rather than a rigid sequence — it often took investigation paths we wouldn't have prescribed. Sometimes those paths surfaced context we'd have missed.
I have spent the last few years watching teams build elaborate playbooks for incidents they hoped would never repeat. The playbooks aged badly. The investigation patterns the playbook captured were the patterns of the analyst who wrote it. New attack patterns required new playbooks. Maintenance ate the team.
If Bow is right (and I think she is), the next generation of SOC tooling is not playbooks at all. It is bounded autonomy: give the agent the tools, give it the goal, audit what it actually did. The investigation strategy becomes emergent rather than encoded.
That has serious implications for any team currently buying SOAR licenses or building deterministic response engineering. The skills shift from writing flows to writing tool boundaries and audit guards. The metric shifts from playbook coverage to disposition accuracy. The org structure shifts from playbook owners to model-output reviewers.
Most security teams will resist this for two or three years. The teams that do not will have a real operational advantage.
What you should do this week if you run a SOC
Five practical actions, ordered by leverage.
Inventory your internal context, not your alert sources. Write down every system where humans have to look when triaging: Slack channels, ticketing, runbooks, code search, the warehouse, dashboards. That list is the work backlog for a CLUE-style platform. If your security tooling cannot read those systems, no amount of LLM cleverness will close the gap.
Pick one investigation type that eats analyst time and prototype the agentic pattern. A common one is "did user X access resource Y in window Z" — exactly the data governance example from the Anthropic post. Build it as an orchestrator with two or three sub-agents in your preferred LLM SDK. Measure how many queries the agent runs versus how many your analysts run for the same answer.
Build the audit transcript layer before you scale. Every tool call, every prompt, every response, every decision. Make it queryable. If your compliance team cannot read the transcript of an investigation, the project dies on first audit.
Run a 30-day false positive baseline. Anthropic went from 33 per cent to 7 per cent. You cannot claim a similar improvement without knowing what your starting rate is. Most SOCs do not measure this rigorously. Start now.
Resist the playbook instinct. When you first deploy an AI-assisted triage system, the temptation is to constrain it with the playbooks you already have. Constrain its tools and its data. Leave its strategy alone for the first month. Then look at what investigation paths it actually took. You will learn more from watching it work than from telling it how to.
The bigger pattern
Anthropic is publishing a series of "how we use Claude" posts and this one is the most concrete and operationally useful of the lot. It is also marketing. Both things are true.
The thing I keep coming back to is the Bitter Lesson framing. Security operations has spent two decades getting more deterministic. SOAR, playbook libraries, prescriptive runbooks. All of it built on the assumption that the right answer is to encode expert reasoning into repeatable flows.
If the next generation of models is good enough to investigate competently with just tools and a goal, then a lot of the deterministic infrastructure becomes overhead rather than uplift. The teams that recognise this early get to redirect their engineering capacity. The teams that double down on playbooks will spend the next three years maintaining systems that future hires will find baffling.
18+ years in tech, the last few in security. I have watched plenty of "the new way" announcements turn into nothing. This one feels different because the numbers are concrete and the architectural pattern is simple enough to copy. Most SOCs will not copy it. The ones that do will look unrecognisable in 24 months.
What is your team doing about AI-augmented detection and response?
Tamil first, English second, 18+ years in tech with the last few in security. I have built and dismantled enough SOC tooling to recognise when a pattern is genuinely new and when it is the same idea in different packaging. CLUE is genuinely new. Most teams will pretend it is not, because acknowledging it means rethinking three years of deterministic tooling investment.