Martin's Blog

Experiments modernising old C code

I wanted to see what modern compilers make of old C code: how much louder they get, what they flag, and how much work it is to clean up under a stricter, security-conscious flag set. The experiment also doubles as a test of AI coding agents on legacy modernisation. Two approaches:

Rewriting straight into Rust or Go would be its own experiment. Here the focus is C-to-C: how well do current models produce clean, modern C while preserving the behaviour of a 20-year-old codebase.

Fixing the existing code

The base is Blackbag 0.9.1 from Matasano Security, dated 20051201. A collection of network testing and security analysis tools for protocol research, penetration testing, and traffic analysis.

The original build used:

1CFLAGS = -Wall -g

The updated profile adds stricter warnings plus hardening:

1CFLAGS = -Wall -Wextra -Wformat=2 -Wshadow -Wstrict-prototypes -Wmissing-prototypes \
2         -fno-strict-aliasing -fstack-protector-strong \
3         -O2 -g -D_FORTIFY_SOURCE=2

I used Claude Code in its default configuration with hooks added to record time and token usage.

The cleanup took 1h 51m of agent time plus ~30m writing documentation. 702 assistant turns, ~70.2M tokens (most of it cache reads, ~67.6M). Uncached input was 7,635 tokens; output was 678,756. Total cost: ~$37.75.

Starting point: ~292 compiler warnings. End state: zero warnings plus an integration test suite with 93 passing, 0 failing, 1 skipped. The agent worked one warning class at a time: -Wsign-compare, -Wunused-parameter, -Wmissing-prototypes, -Wstrict-prototypes, -Wincompatible-pointer-types, -Wunused-result, and so on.

By raw count: 41 fallthrough warnings (intentional), 17 unused parameters, 12 missing prototypes, 11 signed/unsigned comparisons, 8 strict-prototype issues, at least 7 pointer/integer truncation bugs, 6 ignored I/O results.

Several findings were not just style. The cleanup surfaced 64-bit pointer truncation, a use-after-free, undefined behaviour from shifting a negative value, implicit function declarations causing pointer corruption, missing string termination, unsafe variadic argument handling, unchecked I/O return values, and incorrect process-handle cleanup.

Rewriting the existing code

Skip the warnings and ask an agent to write a clean reimplementation from scratch. Blackbag stays as the reference; the agent reimplements the tools in C17 and uses tests for behavioural equivalence.

Three agents, same prompt, all in “YOLO mode” (free to read, build, test, fix, iterate without per-step approval).

Agent Version Model Mode
Pi v0.73.0 Not specified YOLO only (by design)
Codex v0.128.0 gpt-5.5, xhigh, fast YOLO on
Claude Code 2.1.131 Sonnet 4.6, Advisor mode with Opus 4.7 Full Access, permission-skip enabled

Latest versions, default models, no skills or subagents, clean state. Identical prompt:

  • create a plan in PLAN.md for the following, dont implement anything until asked
  • under docs/blackbag-0.9.1/ is the original source code for a number of tools used for miscellany of network testing and security analysis tools for protocol research, penetration testing, and traffic analysis
  • reimplement the tools in modern C (C17), and not in C23 because it is not fully supported by clang and gcc
  • must support both big endian and little endian architectures
  • must support both 32 and 64 bit architecture
  • create a Makefile for compiling, building and testing the tools
  • enable all best-practice compiler warnings and flags
  • use gcc for development and testing
  • structure the new code like the existing code with one tool per .c file
  • create test cases for all code, use golang format like asn1_test.c, run via make test
  • all functions must be commented, comment critical code and special cases
  • always prefer clean and safe code over weird or exotic optimisations

After the plan was written, agents switched to execution mode. Only Claude Code asked clarifying questions:

Pi and Codex picked an answer and went.

Results

Tool parity

The original has 38 C programs: 32 in the root Makefile, an extra sextract not in the Makefile, and 5 ASN.1 utilities under asn/. Plus two shell wrappers (bkb, asn) and one data file (sub.macros).

Component Original Claude Code Codex Pi
Root Makefile targets 32 32 32 32
Extra root tool (sextract) 1 1 0 1
ASN.1 tools (asn/) 5 5 0 0
Total C programs 38 38 32 33
Wrapper scripts 2 0 0 0
Data files 1 0 0 0

Claude Code got to 100% by reading the source rather than enumerating Makefile targets, picking up sextract and all five ASN.1 utilities.

Pi stopped at 33/38. Its own PLAN.md had “Phase 6: ASN.1” and “Phase 7: TCP reassembly” written down, but it never returned to them.

Codex was the strictest reader of the prompt: exactly the 32 Makefile targets, nothing more. Discipline or under-scoping depends on how you read the prompt. None of the three reproduced the wrappers or sub.macros.

Compiler flags

The original used -Wall -g. All three reimplementations did much better, in different ways.

Compile-time hardening Static analysis warnings Runtime sanitisers
Claude Code -fstack-protector-strong, -D_FORTIFY_SOURCE=2, -Wstack-protector, -Wnull-dereference -Wall -Wextra -Wpedantic -Wshadow -Wformat=2 -Wformat-security -Wconversion None
Codex -fno-common -Wall -Wextra -Wpedantic -Wconversion -Wsign-conversion -Wshadow -Wpointer-arith -Wformat=2 -Wundef -Wcast-align -Wwrite-strings -Wvla Optional via ASAN=1 UBSAN=1
Pi None -Wall -Wextra -Wpedantic -Wshadow -Wformat=2 -Wstrict-prototypes -Wmissing-prototypes On by default in debug builds: -fsanitize=address,undefined

Claude Code has the strongest compile-time mitigations and is the only one that links libevent, libssl, libcrypto, and libpcap, the dependencies the network tools need. Codex has the most aggressive static analysis: -Wconversion plus -Wsign-conversion for implicit signedness changes, -Wcast-align for alignment errors, -Wvla to ban variable-length arrays outright. Pi runs with AddressSanitizer + UBSan on by default in debug builds, so every test run is monitored at runtime with no opt-in.

Claude Code prevents the bug from being exploitable. Codex prevents it from compiling. Pi catches it the moment it happens.

Code style

The three approaches show clearest in dynamic buffer growth, the operation behind almost every CVE in C code that processes untrusted input.

Claude Code writes idiomatic C17 with end-pointer arithmetic, faithful to the original:

1if ((size_t)(r->ep - r->tp) < len) {
2    size_t tot = (size_t)(r->ep - r->bp);
3    if (tot < len) tot = len;
4    uint8_t *nbp = realloc(r->bp, tot * 2);   // tot * 2 is not overflow-checked
5    if (nbp) {
6        r->bp = nbp;
7        r->ep = r->bp + tot * 2;
8    }
9}

Correct as long as tot * 2 does not overflow size_t, which is left to developer discipline.

Codex saw that buffer-manipulation logic was repeated across many of the tools and extracted it into a structured buffer library. The same pattern BoringSSL uses for CBB and s2n-tls uses for its stuffer. The growth helper at the heart of that library checks for overflow explicitly:

1while (cap < need) {
2    if (cap > (SIZE_MAX / 2U)) {       // explicit overflow guard
3        bb_die("buffer size overflow");
4    }
5    cap *= 2U;
6}
7buf->data = bb_xrealloc(buf->data, cap);

Allocations go through bb_xmalloc / bb_xrealloc wrappers that abort on NULL; errors go through bb_die. One buffer abstraction, audited once, used everywhere.

Pi uses a similar tbuf_ensure helper but abort()s on OOM. ASan-on-by-default catches the rest at runtime.

Allocation handling shows the same split. Claude Code null-checks every malloc site by hand:

1out = malloc(need + 1);
2if (out == NULL) {
3    fprintf(stderr, "b64: malloc failed\n");
4    return 1;
5}

Codex makes that impossible: direct malloc does not appear outside the wrapper.

1void *bb_xmalloc(size_t size) {
2    void *p = malloc(size ? size : 1U);
3    if (p == NULL) {
4        bb_die("out of memory");
5    }
6    return p;
7}

Other notes. Claude Code is the only one that kept the original’s custom “Fake64” alphabet; Pi swapped it for standard RFC 4648 Base64 (cleaner, no longer bit-compatible with the original). Codex is the most consistent user of <stdbool.h> and <stdint.h> throughout. Claude Code returns error codes from individual functions and leaves callers to handle them; Codex and Pi fail fast.

Ranking on defensive style: Codex, Pi, Claude Code. Ranking on feature completeness: Claude Code, Pi, Codex. The inversion is not an accident.

Time and token cost

Claude Code Codex Pi
Output tokens ~875,000+ ~119,718 ~144,000
Active work time ~2h 30m+ ~2h ~1h 45 min
Tool parity 100% (38/38) 84% (32/38) 87% (33/38)

Claude Code used roughly 7× more output tokens than the others. The delta lines up with the difference in scope: those tokens went into the ASN.1 suite and the network utilities Codex and Pi skipped. Wall-clock time was closer than the token gap suggests. Claude Code ran for about 30 to 45 minutes longer than Codex and Pi, not 2× or 3× longer. Codex and Pi stopped short, and the last 15% of any codebase is the hardest 15%.

Observations and conclusions

Three agents, one codebase, one prompt. The interesting result is not that one of them won. It is that they split cleanly along three axes:

If you want… Use
Full feature parity with original Blackbag Claude Code
Production-grade defensive C style Codex
Aggressive runtime guardrails for development Pi

The agents differ less in how well they write C than in how they interpret scope and risk. Codex and Pi both stopped well short of complete, but neither did so randomly. Codex stopped at the line drawn in the Makefile. Pi stopped at the boundary where its own plan got harder. Claude Code went past both lines.

Given the same prompt, the three made very different judgement calls about where “done” is. That is harder to evaluate than warning counts, and probably the part that matters more.

#C #Modernisation #Legacy-Code #Ai-Assisted #Claude #Codex #Pi #Blackbag #Gcc #Sanitizers