Assurance

Testing a Security Tool Like It Can Hurt People

A security tool cannot be tested like a normal CLI. We built a standing assurance platform around deterministic fixtures, real-kernel runners, preserved artifacts, and red runs that prove the system can fail loudly.

Emphere EngineeringAssuranceJune 8, 20269 min read

Security tools fail quietly

A security tool cannot be tested like a normal CLI. When a normal CLI is wrong, it usually crashes, exits non-zero, or produces output so malformed that something downstream rejects it.

A security tool can fail much more quietly. It can return a clean report with the wrong reachability label, the wrong process attribution, or the wrong confidence. The UI renders, the pipeline passes, and someone makes a decision on top of a result that should never have been trusted.

The failure mode we care about most is output that looks certain when it is wrong.

One of our runners went red because the tool guessed. The fixture had multiple Python worker processes, and in that situation the correct behavior is to abstain. If more than one process could have caused an import, the tool should say it cannot attribute the event, not pick a process and make the report look more precise than it is. In the run, that mistake showed up as 157 inferred attributions where the invariant allowed 0.

A security product cannot afford to ship that kind of bug. The assurance system has a concrete job: catch the moment the tool starts sounding more certain than the evidence allows.

We are building a container security tool that answers questions across static analysis, runtime evidence, and vulnerability data: what is installed, what is linked, what application code can reach, and what was actually observed when the container ran. If that evidence is wrong, the product is wrong quietly.

The only useful defense is an oracle: a deterministic check that knows what the answer should be and fails loudly when the tool disagrees. The oracle is not a model and it is not a dashboard. It is a repeatable artifact with an expected value, an actual value, and a failure mode.

Reading guide

Before the rest of the post gets technical, here is the vocabulary in plain English.

An oracle is a check with a known right answer. If the tool disagrees, the run fails.

A fixture is a small test container built to prove one behavior: a static Go binary, a dynamic cgo binary, or a Python app with multiple worker processes. An invariant is the rule that fixture must always satisfy. A static binary should not show a runtime shared-library load. A multiprocess app should not get guessed attribution.

A collector is the mechanism that observes runtime behavior. In this post, that means a local proc-maps collector and a Linux eBPF collector. Runtime evidence is what the tool actually observed when the container ran. Process attribution is tying that observation back to the process that caused it. If we cannot do that safely, the tool should say it does not know.

What we test

Correctness for this kind of product is not one property, so the assurance work cannot be one test. The first slice we have pushed hard is runtime evidence, process attribution, fixture invariants, and collector agreement. The larger map is what we have to keep proving over time:

1. Static graph construction. Packages, native extensions, OS layers, ownership.

2. Runtime evidence. What loads during the observed window.

3. Process attribution. Which process caused an observation, and when to refuse to guess.

4. Collector agreement. Different observers should agree on the deterministic verdict even when raw signal volume differs.

5. Vulnerability database behavior. Feed drift must not masquerade as a tool regression.

6. Fixture invariants. Behavioral expectations, not brittle counts.

7. Exploit confirmation. Exercise a vulnerable path and verify that the patched twin goes silent.

8. AI triage. Agents read and explain evidence; they never produce truth.

The product still has to be right about the rest. The graph it builds, the vulnerability data it joins in, exploit confirmation, and AI triage all stay open until they have the same shape of proof: a green run, a red run, and an artifact behind the claim.

The first proven slice

We started with runtime evidence because that is where correctness gets subtle.

If a library was loaded by the app server, that is useful evidence. If it was loaded by a shell helper or an injected reader, that is a different claim. If a Python import is attributed to the wrong worker in a multiprocess app, the report looks more precise while becoming less true.

So the first oracle checks process attribution. In a single-process runtime, inferred attribution is acceptable when there is exactly one candidate. In a multiprocess runtime, the system must abstain rather than guess.

The current corpus is intentionally small enough that each fixture has a job. Static Go and dynamic cgo separate "compiled in" from "loaded at runtime." Single-process Python proves the safe inference case. Gunicorn and explicit Python multiprocessing prove that the system refuses to guess when more than one process could have caused the event.

go-static

Static Go binary: zero runtime shared-object load

go-cgo-dynamic

Dynamic cgo binary: shared-object load must be observed

probe-python-single

One interpreter: package attribution may be inferred

python-gunicorn

Pre-fork server: native layer captured, unsafe inference avoided

python-multiprocess

Multiple workers: ambiguous attribution must abstain

Coverage5 fixtures · 22 invariants · all green on clean checkout

These are behavioral checks, not count snapshots. Counts can move when the graph gets better, but the blocking checks are about what the oracle lets the product claim. The probe must run, dynamic linkage must produce a runtime load, static linkage must not, process fidelity must be recorded, and multiprocess attribution must not invent certainty.

On the canonical green run, all five fixtures passed: 22 of 22 invariants, zero blocking failures, on a real Linux amd64 kernel (6.17).

Evidence status

fixtures5

invariants22 passing, 0 blocking failures

runtimereal Linux amd64 kernel

collectoreBPF lane exercised; local proc-maps lane also compared

The red run is the point

A suite that only goes green is not proving much. What matters is whether it can fail in the specific way the product would otherwise fail customers.

One runner went red in exactly the way the oracle was built to catch. The infrastructure was healthy: kernel setup, binding generation, and artifact upload all succeeded. What failed was the code under test.

In the python-multiprocess fixture, the multi-candidate-abstention invariant expected inferred attribution to stay at 0. It came back with 157. The tool had attributed imports in a multiprocess app instead of abstaining.

Evidence status

run label2026-06-08 red run

environmentLinux amd64, kernel 6.17

corpus5 fixtures / 22 invariants

failureexpected 0, actual 157

evidence noteraw artifact retained in the internal evidence pack

derived.inferred_count

0→157

python-multiprocess
Blocking invariant failure: multiprocess attribution guessed when it should abstain

The same defect tripped a unit check independently: an ambiguous two-process case should not be attributed, but the function returned an inferred process source. That gave us two checks pointing at the same cause, with only one fixture reddened out of five. A regression in inference became a red artifact with a fixture name, an expected value, an actual value, and preserved logs, not a confident runtime claim shipped to a user.

We also keep a seeded-regression proof for this class of bug. We intentionally weaken the abstention guard, run the suite, and make sure python-multiprocess fails. Then we put the guard back and make sure the suite returns to green. That proof is saved as an internal artifact, not described from memory.

One collector is not the story

The kernel collector matters, but it is not the story by itself.

Runtime evidence can come from different observers. A local proc-maps lane is useful and cheap. A Linux eBPF lane is more authoritative for process-aware events. The important question is not whether their raw counts match, because they should not. The important question is whether the deterministic verdicts match. If they did not, the tool's answer would depend on where you ran it: a developer on Docker Desktop would get a different security verdict than CI on Linux.

On the same fixture corpus, the two collectors agreed on the behavior that matters:

Evidence status

static Go runtime lib-loadeBPF 0; proc-maps 0; agree

multiprocess inferred attributioneBPF 0; proc-maps 0; both abstain

single-process inferred attributioneBPF 157; proc-maps 157; agree

raw signal volumediffers substantially by collector

fidelityeBPF authoritative; proc-maps live snapshot

readingwe assert behavior, not raw signal volume

We do not turn every number into a gate. Some numbers are facts to explain. The gate belongs on the claim the product makes.

Confirming a real vulnerability

Detecting a CVE is not enough. The scanner has to show whether the vulnerable package is reachable, whether the path was actually exercised, and whether the same test goes quiet after the fix.

We have one first matched pair, not a full exploit lane yet. It uses the same app, the same startup path, and the same runtime exercise, with only the dependency version changed. One side uses the vulnerable version of requests; the other uses the fixed version.

Evidence status

CVECVE-2023-32681 in requests

vulnerable variantCVE present; statically reachable; runtime observed; confirmed

patched variantCVE absent after version bump

controlled variabledependency version

statusfirst exploit-confirm data point; not yet a gating lane

It is a small result, but it has the shape we want: vulnerable build confirms, patched twin clears, and the changed variable is narrow enough that the result means something. We have not promoted this into the fixture gate yet. Until we do, it stays labeled as a first data point.

Why AI is not the oracle

We build AI agents. We are not going to let one decide whether our security tool is correct.

An AI evaluator can read evidence, summarize a run, and flag what looks suspicious. What it must never do is mint truth by asserting that a run passed when no deterministic check did.

The useful output shape is narrow: verdict from the deterministic gate, failed invariant, expected value, actual value, corroborating unit failure, and a hypothesis about the likely code path. The agent does not assert pass or fail, does not modify an expected value, and does not touch the gate.

Oracles decide; agents triage.

What remains unproven

The edges matter. Here is what we are not claiming yet.

Evidence status

static graph constructionexercised in every run but not yet asserted by dedicated invariants

pinned vulnerability database lanenot yet proven for repeated inventory/count determinism

exploit confirmationone matched pair; not yet promoted into the fixture gate

AI evaluatoradvisory shape defined; no gate write path

fixture corpussmall: five deterministic fixtures plus one exploit pair

arch/kernel matrixamd64 and one recent kernel so far

infrastructure cleanupself-delete is best-effort until a sweeper exists

What is next

The pinned database lane should prove that repeated runs against the same vulnerability snapshot produce stable results. The exploit-confirm pair should become a real fixture lane with invariants. The AI evaluator should read deterministic reports and cite them without ever minting truth. The kernel matrix should expand beyond one amd64 kernel.

Until each lands with an artifact behind it, we will not claim it.

An assurance system is never complete. It grows with the product. What matters is whether the standard is real enough to fail us when the tool overclaims. The first slice has already done that. The next post goes inside the real-kernel lab that made these runs possible.