Groundzero

Groundzero: Building an Isolated Kernel and Exploit Lab

Groundzero is Emphere's research lab for automated security validation: real kernels, sandboxed exploit ranges, preserved artifacts, and agentic loops that make testing, exploitation, and remediation more exact over time. This is the story of building it, including the parts that broke.

Emphere EngineeringAssuranceJune 17, 202611 min read

We paused the product to build a lab

We stopped shipping features for a while to build a place to test ourselves.

The reason was uncomfortable. Our eBPF runtime path had quietly rotted into three real bugs, and we did not notice for a while, because nothing was running it on a real kernel on every change. The unit tests were green. The product looked fine. The thing that was supposed to watch processes from the kernel was wrong, and it was wrong silently.

That is the failure mode that should scare anyone building a security tool. A normal program crashes when it breaks. A security tool keeps returning clean-looking answers that happen to be false. You cannot test your way out of that with ordinary CI, because ordinary CI runs on whatever kernel the vendor handed it and never touches a real exploit.

So we built Groundzero. It is the research lab where a security claim gets exercised, attacked, contradicted, repaired, and re-run with the evidence kept. Real kernels for the runtime collectors. Sandboxed ranges for vulnerable services. An artifact behind every result. Agents to help triage and improve the system, with no permission to invent truth.

This post is the story of building it. The forks we took, the parts that broke, and what a clean run looks like now.

Reading guide

A few words in plain English first.

A runner is a short-lived machine that executes one validation run. It boots, builds the product from a known commit, runs the suite, uploads artifacts, and goes away.

An artifact is what survives the run: logs, raw scan output, fixture reports, kernel metadata, commit metadata, exploit output, and per-stage status. If a claim matters, it points back to an artifact.

A kernel collector observes runtime behavior from the operating system. The one that matters here is eBPF, which watches process behavior from inside the kernel and has to be tested on a real one.

An exploit range is the fenced part of the lab where vulnerable services can be started and attacked, with the evidence kept and the blast radius contained.

An agentic loop is the automation we are building around the artifacts. Agents read reports, cluster failures, propose fixtures, suggest exploit adaptations, and draft remediations. The artifacts decide what happened.

Decision one: where do you run a test that needs a real kernel

The first fork was the boring one that decides everything else.

eBPF needs a recent kernel, BTF, and the privilege to attach probes. That rules out most of the obvious places. A laptop on macOS runs Docker inside a Linux VM with a kernel we did not choose and cannot pin. Cloud Build and a managed Kubernetes cluster do not hand you a privileged, disposable, real-kernel box you can attach probes to. We needed root on a real kernel that we could throw away after every run.

So we went to the least magical option: an ephemeral, privileged virtual machine. It boots from a startup script, builds the product, runs the suite, uploads the results, and deletes itself. No standing fleet. No pet servers. Each run is a fresh machine that exists for as long as the work takes and not a second longer.

We gave up managed CI convenience for the one part we could not fake: the test runs on a real kernel, and we can prove which one.

Decision two: separate the work you trust from the work you do not

The second fork was about blast radius.

Two kinds of work happen in this lab, and they do not deserve the same trust. One is running our own code against our own fixtures. The other is starting a genuinely vulnerable service and attacking it. The first is privileged but friendly. The second is privileged and, by design, closer to hostile.

We split them. A trusted runner does the deterministic assurance work: build the product at a known commit, generate the eBPF bindings for the live kernel, run the fixture suite, diff against the previous run, upload the artifacts. A separate exploit range is where vulnerable services and exploit confirmation belong, behind a stricter boundary, because the service may be hostile and the exploit may be adapted.

Here is the shape we settled on.

Two jobs, two blast radii. Everything leaves as an artifact.

One honest caveat. Today the exploit lane still runs from the trusted runner against an egress-denied container network, as an interim posture. The fully separate range is designed and partly wired. We will not call it complete until the whole path runs there. The boundaries in that diagram are real, but the last one is still being built.

The runner is allowed to do very little

The trusted runner is built to do one job and leave.

It boots with no external IP address. It reaches the internet through a NAT only for startup work: packages, toolchains, and source. It reads a read-only repository token from Secret Manager, uses it once, and shreds it. It uploads to a single artifacts bucket. It carries a watchdog so a stuck run turns into a deleted machine instead of a meter that runs all night.

The exploit side is tighter. The container network denies egress. The metadata endpoint is blocked so a compromised target cannot read the instance credentials. The output model is file-first: if the lab proves something, it leaves as an artifact.

Evidence status

runner networkno external IP; NAT egress for startup only

credentialsread-only repository token, used once, then shredded

runner lifetimeephemeral VM with a watchdog that deletes on timeout

exploit networkegress denied; metadata endpoint blocked

output modelartifact upload; claims point back to files

Short-lived compute, narrow credentials, explicit network boundaries, evidence that survives the run. That is the posture we want around the hard parts. Getting there was not clean.

The lab became useful by failing

The first useful thing Groundzero did was show us how much of our own plan was wrong.

It started with the toolchain. The runner booted, and Go could not find its module cache in the minimal startup environment, so the build that was supposed to be deterministic was anything but. Then the artifact upload turned out to need more than write access to behave. Then we learned that the self-delete command we trusted did not do what we thought, which on an ephemeral-VM design is the difference between a clean lab and a surprise bill.

Then there was apt. A fresh machine's first package fetch over the NAT would occasionally return a transient failure, and because that fetch installs the C compiler, one network blip would starve the whole build and fail the run. It cost us a couple of forty-minute runs before we wrapped the package install in a short retry. None of these were product bugs. All of them mattered, because a lab that cannot reliably build, upload, or clean up after itself is not an assurance system yet. It is a script that sometimes works.

Evidence status

toolchainfresh runner now pins Go module cache + records toolchain setup

uploadfinal-status.txt records upload success or failure, every run

self-deletewatchdog deletes on completion and on timeout

aptpackage install wrapped in a retry after a transient blip wasted two runs

Then the product failed where it mattered

The infrastructure failures were warm-up. The ones that justified the whole project were in the kernel path.

The eBPF collector deadlocked under high signal volume. A container with a busy startup produced enough runtime events to wedge the collection path, which is the exact moment you need observability to keep up rather than freeze. We found it because a real container on a real kernel pushed enough load to trigger it. A mocked test never would have.

Then, after that fix, the collector missed startup-only library loads. It was attaching just slightly too late, after the process had already done its most interesting loading near the very beginning of its life. So the collector was technically running and quietly incomplete, which is the worst combination for a tool whose job is to be trusted.

Both became fixes, regression tests, and preserved artifacts. That is why the lab exists: not to make the product look clean, but to make weak claims fail loudly enough that we have to strengthen them.

The containment bug that broke every test

The best flop was self-inflicted, and it is my favorite story from this build, because the lab's own safety control is what broke the lab.

We wanted the vulnerable containers to have no way out. The instinct was simple: put them on a Docker network marked internal, which blocks the container from reaching anything. We did, and every single bring-up failed. The exploit lane could not get a single target to respond.

The symptom was a container that started fine and then answered HTTP 000 on the port we had published for it. Not a 500, not a timeout with a reason. Nothing. It took an embarrassing amount of staring to see it: the one flag that blocked the container's egress was also blocking the path we used to reach the container. We had locked the door from both sides.

The fix was a fork in how we thought about containment. The blunt internal flag denies traffic in both directions. What we actually wanted was asymmetric: reachable from the test harness, unreachable to the internet. So we moved to an ordinary bridge network and denied egress at the firewall instead, which lets the harness reach the published port while the target still has nowhere to call out.

A normal network answered 200. The locked one answered 000. That one digit was the bug, and chasing it taught us that a containment control you do not test is just a different way to be wrong.

Evidence status

symptomevery brought-up target answered HTTP 000 on its published port

causethe internal-network flag blocked ingress as well as egress

fixordinary bridge + egress denied at the firewall (asymmetric containment)

checknormal network → 200; locked network → 000

lessonan untested containment control is just a quieter way to be wrong

What a clean run leaves behind

Once the flops were paid off, the runner started producing the thing the whole project was for.

A full run now goes from a fresh machine to a self-deleted one in about forty minutes, building the product, running the fixture suite on a real kernel, exercising the exploit lane, and uploading everything. The output is deliberately plain. It is not a dashboard and it is not a screenshot. It is a directory of files.

Evidence status

kernel6.17.x-gcp x86_64, recorded per run

fixtures5 fixtures, 22 invariants, on the real kernel

statusper-stage status.txt: clone, toolchain, bpf2go, eBPF unit, assurance, diff, upload

lifecyclefresh VM → run → self-delete, about 40 minutes, confirmed clean

artifactsthe exact commit, kernel string, build and unit logs, per-fixture report + scan, and a run-over-run diff

The files matter more than the summary. When a run goes red, we do not want a message that says something failed. We want the fixture name, the expected value, the actual value, the raw output, the logs, the kernel, the commit, and the diff sitting in one place, ready to read. That is how a failure turns into a fixture, then a gate, then a product claim we can defend.

One run, start to finish. Everything between the ends is preserved.

What the lab has actually run

It is worth saying the size out loud, because "a fixture suite" badly undersells it.

The numbers below are not a roadmap and not a projection. They are what is sitting in the artifacts bucket right now, across every run we have kept.

Evidence status

preserved runs42 ephemeral runner sessions, each with its kernel, commit, logs, and outcomes

exploit corpus152 vulnerable application families, 484 CVE scenarios

exploit attempts3,430 image-by-CVE fire records, every one with a deterministic outcome

confirmed exploits178 across the runs, each tied to the path that actually fired

reachability oracle30 real Go images measured against an independent call-graph tool: 83.4% precision, 83.8% recall

ground-truth fixtures36 curated constructed cases across 5 ecosystems: 100% recall and 89.5% precision. The lab's own safe-method fixtures surfaced package-level false positives on npm and Python that a flat score would have hidden

symbol intelligence783 Go CVE-to-function mappings ingested for symbol-level reachability

Two caveats on those numbers.

The exploit attempts are image-by-CVE fire records across many runs, and a lot of those runs re-fired the same corpus while we hardened the loop. So 3,430 is throughput, not distinct targets. The distinct corpus is the 484 scenarios, and the confirmed exploits are deduplicated by the only thing that counts: a vulnerable path that actually fired and left evidence.

The accuracy lanes and the exploit lane answer different questions on purpose. The exploit lane proves positive reachability: a real vulnerable path fires under a real workload, and when deph had called that path not-reachable, the miss is proven. The constructed fixtures and the differential oracle measure precision and recall against a known answer. One catches what we miss. The other catches what we over-claim. Neither is allowed to grade itself.

That separation matters because runtime evidence is not symmetric. If deph says a vulnerable path is not reachable and the lab watches that function execute, the miss is proven. If deph says a path is reachable and the lab does not see it execute, that proves much less: only that this workload did not hit it. The path may need a different input, a different config, or a branch the driver never reached. So Groundzero treats observation as a way to prove missed risk and confirm true positives, never as a way to clear a finding by silence. Precision needs different evidence: inventory contradictions, absent symbols, absent files, constructed answer-key fixtures, and careful adjudication.

The constructed corpus number is deliberately scoped. It says how those answer-key cases scored. It does not say the product is 100% accurate in the wild. The lab has already found frontiers that we did not paper over: Java dead-code paths, Python lazy imports, and framework dispatch that a static call graph cannot safely demote yet.

Every one of these points back to a file. None of it is a dashboard we typed a number into.

The most useful thing the lab found was our own miss

The exploit lane surfaces our real misses by firing real exploits. The differential lane does it a quieter way: it runs deph beside the Go team's own call-graph tool over real images and flags every place the two disagree. One of those disagreements was ours, and it was the dangerous direction.

deph had a rule that looked obviously right. If a vulnerable function is not in a compiled binary's symbol table, the vulnerable code is not there, so downgrade the finding. The trouble is the compiler. Small functions get inlined into their callers — the vulnerable code is compiled in, doing exactly what the advisory warns about, but it no longer has a symbol of its own. deph saw "symbol absent," concluded "not present," and quietly cleared a finding that was real. A false negative: the worst kind of error for a scanner, because nothing looks wrong.

The differential caught it on a JWT library — an audience-check bypass that was live in the binary, inlined into the validation routine, and which deph had downgraded to "installed." We fixed that finding. Then the harder question, the one the lab exists to ask: how many more like it? An audit of the vulnerability-symbol database found the same shape in roughly a quarter of the Go entries — one small, inline-prone function with nothing to fall back on.

The tempting move was a blanket rule: when a downgrade rests entirely on a single inline-prone function, never downgrade. So we built it — and then did the thing that actually matters. We measured it. The runner ran the differential before and after, over thirty real images. The blanket rule recovered exactly one real miss and introduced fifty-one new over-claims — a five-point precision drop to fix a single finding. On the binaries we tested, the inlining we feared almost never actually happened; most of those downgrades were correct. So we threw the rule away and kept the surgical fix: when the oracle proves a specific finding was inlined out, we add the surviving caller to that one entry and leave the rest alone.

That story is the lab in miniature. The important part is the ending. The product looked correct; a deterministic oracle caught a real miss. We had a clever fix for the whole class; the same oracle measured it and told us it was a bad trade. We believed the measurement over our own idea — the miss got fixed, the clever rule got rejected, and both decisions point back to numbers in a file.

The same trap, one layer down

The Go miss was inlining defeating a static rule. The next one was inlining defeating a runtime oracle — and it taught us a sharper version of the same lesson.

We wanted one clean idea for runtime reachability: watch from the kernel. For Go it works beautifully — attach a probe to the exact vulnerable function and a fire means that code ran, with no agent inside the container that the container could lie to. We tried to carry the same idea to Python by attaching the probe to the interpreter's evaluation loop, the one C function every Python frame is supposed to pass through. Watch the door, see every function.

So we tested it the only way that counts. Not a unit test — a real application: a small web service with a genuinely vulnerable PyYAML version, brought up in the range and hit with a real request that runs the vulnerable yaml.load. The function ran. The kernel probe saw nothing.

The cause is the same word as before. Modern CPython inlines a pure-Python call into its caller instead of re-entering the evaluation loop — the optimization that made the interpreter faster also means the "door" is only used when control crosses from C into Python, not for an ordinary Python-calling-Python call. We confirmed it with the kind of controlled experiment the lab is for: the exact same function, attached at the exact same address. Called directly, the kernel probe recorded it zero times across a run of requests. Called through a C builtin so the call crossed the boundary, it fired on every one — five for five. The probe was attached correctly. It was simply blind to the common case by construction.

That killed the tidy story, and the honest version is better. A kernel probe cannot see everything inside an interpreter that hides its own calls, so the comprehensive signal has to come from inside the interpreter — Python's own function-entry instrumentation, switched on for the test window and read back out. That path uses sys.monitoring; it does not need kernel offsets. The offset work belongs to the companion eBPF probe, where we mapped the interpreter's frame layout empirically across CPython 3.11 through 3.14 so the kernel-side boundary observer could be checked instead of guessed. The kernel probe keeps the job it is actually good at: the functions that cross a C boundary, and compiled languages like Go, where it is also the un-forgeable signal for code we do not trust. Two observers, each used where it tells the truth.

And the rule that makes this safe is the one the lab keeps returning to. A fire is evidence the function ran. Silence is not evidence it cannot. A function nobody exercised in a thirty-second window is "not observed," never "not reachable" — the dynamic signal is only allowed to strengthen a finding, never to quietly clear one. That is the same boundary that stopped the Go inlining miss from becoming policy, applied before it could become one here.

PHP gave us the same lesson with a different trap. We expected it to be cleaner than Python: a named execute_ex symbol, a single engine entry point, no obvious interpreter inlining story. The runner disagreed. Stock PHP 8 uses the Zend HYBRID VM, and a direct PHP-to-PHP method call does not re-enter execute_ex. The eBPF probe attached correctly, but it was boundary-only: the direct call fired zero times, while the same shape called through a C boundary fired 119 times.

So PHP got the same layered oracle as Python, but with PHP's native hook. The kernel-side execute_ex probe stays the un-forgeable boundary confirmer. The comprehensive truth layer is a Zend Observer extension, built inside the trusted fixture image and labeled as in-process evidence. That observer caught the direct call the kernel could not see, and then confirmed a real Composer CVE end to end: Twig CVE-2022-39261, observed through FilesystemLoader::findTemplate, recorded as in-process tracer evidence. It is not product instrumentation. It is Groundzero's answer key.

The oracle became useful immediately. In a two-case Twig adjudication it confirmed the real FilesystemLoader::findTemplate path, then caught a method-level false positive where an app used Twig through ArrayLoader while deph still marked CVE-2022-39261 as reachable. That is not a PHP precision rate yet. It is a measured over-claim class, and it pointed at an obvious-looking fix: guarded PHP method-level refinement.

So we built that fix too, and the lab rejected it. The refinement preserved the true positive when FilesystemLoader was referenced plainly, and it fixed the ArrayLoader false positive. Then it broke on ordinary PHP: new Environment(new FilesystemLoader(...)). Our static extractor saw the outer Environment construction, missed the nested FilesystemLoader, and would have demoted a real vulnerable path to "installed." That is a false negative, so the rule does not ship. PHP stays package-level and recall-first in the product; the Observer is the method-level truth layer in the lab.

The agentic part has a hard boundary

We use agents here because the surface area is too large to work by hand. Agents can find patterns across failed runs, summarize a diff, cluster similar failures, propose a new fixture, suggest an exploit adaptation, or draft a remediation test. Over time that becomes a loop: the lab makes evidence, agents help turn evidence into better tests, and the next run says whether the improvement was real.

The rule is the same as in the first post. Agents do not decide truth. The recorded result does. If an exploit fires, the evidence is the exploit output. If a remediation works, the evidence is the vulnerable path going quiet. If a kernel collector regresses, the evidence is the fixture failure and the raw logs. The agent can read, explain, and propose. It cannot mint a pass. That is how we use AI here without turning the lab into another source of confident wrongness.

What Groundzero is allowed to prove

Groundzero is not a synonym for safe. It is a research environment with boundaries and receipts, and it is honest about its own edges.

It can prove that the kernel collector builds and runs on a real Linux kernel. It can prove that fixture invariants pass or fail on that kernel. It can prove that a run produced artifacts instead of vanishing into a terminal scrollback. It can prove that behavior broadened or narrowed against the previous run.

It cannot prove every production kernel from the one kernel we have in evidence so far. It cannot prove the exploit range is complete while the exploit lane still runs partly through the trusted runner. It cannot prove cost discipline until billing data is tied to run identifiers. The work includes saying exactly where the boundary sits.

Evidence status

provenreal-kernel runner output, with the exact kernel and commit recorded

proveneBPF build and unit path plus the fixture suite, kept as artifacts

provenclean self-delete and per-stage status across runs

provena Python CVE confirmed reachable by sys.monitoring on a real request, with eBPF kept as the boundary observer

provena PHP real-CVE oracle: eBPF execute_ex for boundary calls, Zend Observer for comprehensive in-process fixture truth

provena Node assurance observer: V8 precise coverage confirms watched npm functions, and a flush-complete marker is required before a zero-hit run is trusted

not yetarm64 and a kernel matrix

not yeta JVM assurance observer

not yetthe fully separate exploit range carrying the whole exploit lane

not yetbilling-backed cost per run

Which cell a claim lands in — and what is allowed to prove it — is the asymmetry the whole lab turns on. You can prove a missed risk by watching the code run; you can never prove an over-claim by watching it not run.

The reachability oracle: each cell, and what proves it. Execution proves the top row; the bottom row needs construction and differential analysis.

The standard

Groundzero is how we turn security research into a repeatable system.

If the claim is about a kernel collector, the test should run on a real kernel and keep the evidence. If the claim is about exploit confirmation, the vulnerable service should run where a compromise has nowhere useful to go. If the claim is about remediation, the patched build should prove the path went quiet. If the claim is about product correctness, the expected value and the actual value should both survive the run.

This is the kind of testing an enterprise security product should grow into. Automated enough to run continuously, sandboxed enough to touch dangerous paths, assisted enough to learn quickly, and grounded enough that the artifacts still decide.

Labs like this are never finished. They grow with the product and with the failures they uncover. What is already real is the part that matters: the lab can make the product fail loudly, preserve the reason, and keep risky work inside a boundary built for it.

The next post goes inside that boundary. We point the exploit lane at real vulnerable services and let it do the one thing this whole lab was built to allow: attack the product's own conclusions and see which ones do not survive contact.