Testing a Security Tool Like It Can Hurt People
A security tool cannot be tested like a normal CLI. We built a standing assurance platform around deterministic fixtures, real-kernel runners, preserved artifacts, and red runs that prove the system can fail loudly.
Security tools fail quietly
A security tool cannot be tested like a normal CLI. When a normal CLI is wrong, it usually crashes, exits non-zero, or produces output so malformed that something downstream rejects it.
A security tool can fail much more quietly. It can return a clean report with the wrong reachability label, the wrong process attribution, or the wrong confidence. The UI renders, the pipeline passes, and someone makes a decision on top of a result that should never have been trusted.
That is the failure mode we care about most: confident wrongness.
One of our runners went red because the tool guessed. The fixture had multiple Python worker processes, and in that situation the correct behavior is to abstain. If more than one process could have caused an import, the tool should say it cannot attribute the event, not pick a process and make the report look more precise than it is. In the run, that mistake showed up as 157 inferred attributions where the invariant allowed 0.
That is the kind of bug a security product cannot afford to ship, and it is exactly the kind of bug our assurance system is supposed to catch.
We are building a container security tool that answers questions across static analysis, runtime evidence, and vulnerability data: what is installed, what is linked, what application code can reach, and what was actually observed when the container ran. If that evidence is wrong, the product is wrong quietly.
The only useful defense is an oracle: a deterministic check that knows what the answer should be and fails loudly when the tool disagrees. The oracle is not a model and it is not a dashboard. It is a repeatable artifact with an expected value, an actual value, and a failure mode.
Confident, wrong output in security is worse than no output at all.
Reading guide
Before the rest of the post gets technical, here is the vocabulary in plain English.
An oracle is a check with a known right answer. If the tool disagrees, the run fails.
A fixture is a small test container built to prove one behavior: a static Go binary, a dynamic cgo binary, or a Python app with multiple worker processes. An invariant is the rule that fixture must always satisfy. A static binary should not show a runtime shared-library load. A multiprocess app should not get guessed attribution.
A collector is the mechanism that observes runtime behavior. In this post, that means a local proc-maps collector and a Linux eBPF collector. Runtime evidence is what the tool actually observed when the container ran. Process attribution is tying that observation back to the process that caused it. If we cannot do that safely, the tool should say it does not know.
What we test
Correctness for this kind of product is not one property, so the assurance work cannot be one test. The product has to be right about the graph it builds, the runtime behavior it observes, the vulnerability data it joins in, and the boundary between what it knows and what it should refuse to infer. This is the map we are working through over time:
1. Static graph construction. Packages, native extensions, OS layers, ownership.
2. Runtime evidence. What loads during the observed window.
3. Process attribution. Which process caused an observation, and when to refuse to guess.
4. Collector agreement. Different observers should agree on the deterministic verdict even when raw signal volume differs.
5. Vulnerability database behavior. Feed drift must not masquerade as a tool regression.
6. Fixture invariants. Behavioral expectations, not brittle counts.
7. Exploit confirmation. Exercise a vulnerable path and verify that the patched twin goes silent.
8. AI triage. Agents read and explain evidence; they never produce truth.
This post is not a claim that every region of that map is finished. It is a proof of method on the first hard slice: runtime evidence, process attribution, fixture invariants, and collector agreement. The remaining regions only become real when they get the same treatment: green runs, red runs, and artifacts behind the claims.
The first proven slice
We started with runtime evidence because that is where correctness gets subtle.
If a library was loaded by the app server, that is useful evidence. If it was loaded by a shell helper or an injected reader, that is a different claim. If a Python import is attributed to the wrong worker in a multiprocess app, the report looks more precise while becoming less true.
So the first oracle checks process attribution. In a single-process runtime, inferred attribution is acceptable when there is exactly one candidate. In a multiprocess runtime, the system must abstain rather than guess.
The current corpus has five fixtures and twenty-two invariant checks: static Go, dynamic cgo, single-process Python, pre-fork Gunicorn, and explicit Python multiprocessing.
These are behavioral checks, not count snapshots. Counts can move when the graph gets better, but the blocking checks are about the behavior the product is allowed to claim. The probe must run, dynamic linkage must produce a runtime load, static linkage must not, process fidelity must be recorded, and multiprocess attribution must not invent certainty.
On the canonical green run, all five fixtures passed: 22 of 22 invariants, zero blocking failures, on a real Linux amd64 kernel (6.17).
Evidence status
The red run is the point
A suite that only goes green is not proving much. What matters is whether it can fail in the specific way the product would otherwise fail customers.
One runner went red in exactly the way the oracle was built to catch. The infrastructure was healthy: kernel setup, binding generation, and artifact upload all succeeded. What failed was the code under test.
In the python-multiprocess fixture, the multi-candidate-abstention invariant expected inferred attribution to equal 0. The actual value was 157: the tool attributed imports in a multiprocess app instead of abstaining.
Evidence status
Blocking invariant failure: multiprocess attribution guessed when it should abstain
The same defect tripped a unit check independently: an ambiguous two-process case should not be attributed, but the function returned an inferred process source. That gave us two checks pointing at the same cause, with only one fixture reddened out of five. A regression in inference became a red artifact with a fixture name, an expected value, an actual value, and preserved logs, not a confident runtime claim shipped to a user.
We also keep a seeded-regression proof for this class of bug: loosen the abstention guard, run the suite, watch python-multiprocess fail; revert it, and the suite returns to green. That proof is saved as an internal artifact, not described from memory.
One collector is not the story
The kernel collector matters, but it is not the story by itself.
Runtime evidence can come from different observers. A local proc-maps lane is useful and cheap. A Linux eBPF lane is more authoritative for process-aware events. The important question is not whether their raw counts match, because they should not. The important question is whether the deterministic verdicts match. If they did not, the tool's answer would depend on where you ran it: a developer on Docker Desktop would get a different security verdict than CI on Linux.
On the same fixture corpus, the two collectors agreed on the behavior that matters:
Evidence status
That distinction is the whole testing philosophy. We do not turn every number into a gate. Some numbers are facts to explain. The gate belongs on the claim the product makes.
Confirming a real vulnerability
Detecting a CVE is half the job. Confirming that it is reachable and exercised, while going silent when patched, is the half that earns trust.
We built a first matched pair around a Python dependency vulnerability: the same app, the same startup path, and the same runtime exercise, with only the dependency version changed. One side used the vulnerable version; the other used the fixed version.
Evidence status
This is not yet a full exploit-confirm harness. It is one clean data point. The scanner confirmed the vulnerable build when the package was reachable and observed, then cleared the patched build when the fixed version removed the CVE.
Why AI is not the oracle
We build AI agents. We are not going to let one decide whether our security tool is correct.
An AI evaluator can read evidence, summarize a run, and flag what looks suspicious. What it must never do is mint truth by asserting that a run passed when no deterministic check did.
The correct output shape is narrow: verdict from the deterministic gate, failed invariant, expected value, actual value, corroborating unit failure, and a hypothesis about the likely code path. The agent does not assert pass or fail, does not modify an expected value, and does not touch the gate.
Oracles decide; agents triage. Agents read evidence; they never produce it.
What remains unproven
This post is only useful if it is honest about its edges.
Evidence status
What is next
The next work is not to make the post sound bigger. It is to make the evidence bigger.
The pinned database lane should prove that repeated runs against the same vulnerability snapshot produce stable results. The exploit-confirm pair should become a real fixture lane with invariants. The AI evaluator should read deterministic reports and cite them without ever minting truth. The kernel matrix should expand beyond one amd64 kernel.
Until each lands with an artifact behind it, we will not claim it. That is the whole point.
That is also the standard we want readers to remember: a security product should be able to show how it knows when it is wrong. This is how we are building at Emphere. The next post goes inside the real-kernel lab that made these runs possible.