Experiments modernising old C code
I wanted to see what modern compilers make of old C code: how much louder they get, what they flag, and how much work it is to clean up under a stricter, security-conscious flag set. The experiment also doubles as a test of AI coding agents on legacy modernisation. Two approaches:
- Fix in place. Turn on strict warnings, have the agent resolve them. The cleanup tends to surface latent security bugs, undefined behaviour, and portability problems as a side effect.
- Rewrite from scratch. Keep the original as reference, have the agent reimplement in modern C17, validate with tests.
Rewriting straight into Rust or Go would be its own experiment. Here the focus is C-to-C: how well do current models produce clean, modern C while preserving the behaviour of a 20-year-old codebase.
Fixing the existing code
The base is Blackbag 0.9.1 from Matasano Security, dated 20051201. A collection of network testing and security analysis tools for protocol research, penetration testing, and traffic analysis.
The original build used:
1CFLAGS = -Wall -g
The updated profile adds stricter warnings plus hardening:
1CFLAGS = -Wall -Wextra -Wformat=2 -Wshadow -Wstrict-prototypes -Wmissing-prototypes \
2 -fno-strict-aliasing -fstack-protector-strong \
3 -O2 -g -D_FORTIFY_SOURCE=2
I used Claude Code in its default configuration with hooks added to record time and token usage.
The cleanup took 1h 51m of agent time plus ~30m writing documentation. 702 assistant turns, ~70.2M tokens (most of it cache reads, ~67.6M). Uncached input was 7,635 tokens; output was 678,756. Total cost: ~$37.75.
Starting point: ~292 compiler warnings. End state: zero warnings plus an
integration test suite with 93 passing, 0 failing, 1 skipped. The agent
worked one warning class at a time: -Wsign-compare, -Wunused-parameter,
-Wmissing-prototypes, -Wstrict-prototypes, -Wincompatible-pointer-types,
-Wunused-result, and so on.
By raw count: 41 fallthrough warnings (intentional), 17 unused parameters, 12
missing prototypes, 11 signed/unsigned comparisons, 8 strict-prototype issues,
at least 7 pointer/integer truncation bugs, 6 ignored I/O results.
Several findings were not just style. The cleanup surfaced 64-bit pointer truncation, a use-after-free, undefined behaviour from shifting a negative value, implicit function declarations causing pointer corruption, missing string termination, unsafe variadic argument handling, unchecked I/O return values, and incorrect process-handle cleanup.
Rewriting the existing code
Skip the warnings and ask an agent to write a clean reimplementation from scratch. Blackbag stays as the reference; the agent reimplements the tools in C17 and uses tests for behavioural equivalence.
Three agents, same prompt, all in “YOLO mode” (free to read, build, test, fix, iterate without per-step approval).
| Agent | Version | Model | Mode |
|---|---|---|---|
| Pi | v0.73.0 | Not specified | YOLO only (by design) |
| Codex | v0.128.0 | gpt-5.5, xhigh, fast | YOLO on |
| Claude Code | 2.1.131 | Sonnet 4.6, Advisor mode with Opus 4.7 | Full Access, permission-skip enabled |
Latest versions, default models, no skills or subagents, clean state. Identical prompt:
- create a plan in PLAN.md for the following, dont implement anything until asked
- under
docs/blackbag-0.9.1/is the original source code for a number of tools used for miscellany of network testing and security analysis tools for protocol research, penetration testing, and traffic analysis- reimplement the tools in modern C (C17), and not in C23 because it is not fully supported by clang and gcc
- must support both big endian and little endian architectures
- must support both 32 and 64 bit architecture
- create a Makefile for compiling, building and testing the tools
- enable all best-practice compiler warnings and flags
- use gcc for development and testing
- structure the new code like the existing code with one tool per .c file
- create test cases for all code, use golang format like
asn1_test.c, run viamake test- all functions must be commented, comment critical code and special cases
- always prefer clean and safe code over weird or exotic optimisations
After the plan was written, agents switched to execution mode. Only Claude Code asked clarifying questions:
- What test format does “golang format like asn1_test.c” mean? (No file named
asn1_test.cexists in the original.) - Which tools are in scope?
- How should the
format/andjenkins-hash/sub-libraries be handled?
Pi and Codex picked an answer and went.
Results
Tool parity
The original has 38 C programs: 32 in the root Makefile, an extra
sextract not in the Makefile, and 5 ASN.1 utilities under asn/. Plus two
shell wrappers (bkb, asn) and one data file (sub.macros).
| Component | Original | Claude Code | Codex | Pi |
|---|---|---|---|---|
| Root Makefile targets | 32 | 32 | 32 | 32 |
Extra root tool (sextract) |
1 | 1 | 0 | 1 |
ASN.1 tools (asn/) |
5 | 5 | 0 | 0 |
| Total C programs | 38 | 38 | 32 | 33 |
| Wrapper scripts | 2 | 0 | 0 | 0 |
| Data files | 1 | 0 | 0 | 0 |
Claude Code got to 100% by reading the source rather than enumerating
Makefile targets, picking up sextract and all five ASN.1 utilities.
Pi stopped at 33/38. Its own PLAN.md had “Phase 6: ASN.1” and “Phase 7:
TCP reassembly” written down, but it never returned to them.
Codex was the strictest reader of the prompt: exactly the 32 Makefile
targets, nothing more. Discipline or under-scoping depends on how you read the
prompt. None of the three reproduced the wrappers or sub.macros.
Compiler flags
The original used -Wall -g. All three reimplementations did much better, in
different ways.
| Compile-time hardening | Static analysis warnings | Runtime sanitisers | |
|---|---|---|---|
| Claude Code | -fstack-protector-strong, -D_FORTIFY_SOURCE=2, -Wstack-protector, -Wnull-dereference |
-Wall -Wextra -Wpedantic -Wshadow -Wformat=2 -Wformat-security -Wconversion |
None |
| Codex | -fno-common |
-Wall -Wextra -Wpedantic -Wconversion -Wsign-conversion -Wshadow -Wpointer-arith -Wformat=2 -Wundef -Wcast-align -Wwrite-strings -Wvla |
Optional via ASAN=1 UBSAN=1 |
| Pi | None | -Wall -Wextra -Wpedantic -Wshadow -Wformat=2 -Wstrict-prototypes -Wmissing-prototypes |
On by default in debug builds: -fsanitize=address,undefined |
Claude Code has the strongest compile-time mitigations and is the only one
that links libevent, libssl, libcrypto, and libpcap, the dependencies
the network tools need. Codex has the most aggressive static analysis:
-Wconversion plus -Wsign-conversion for implicit signedness changes,
-Wcast-align for alignment errors, -Wvla to ban variable-length arrays
outright. Pi runs with AddressSanitizer + UBSan on by default in debug
builds, so every test run is monitored at runtime with no opt-in.
Claude Code prevents the bug from being exploitable. Codex prevents it from compiling. Pi catches it the moment it happens.
Code style
The three approaches show clearest in dynamic buffer growth, the operation behind almost every CVE in C code that processes untrusted input.
Claude Code writes idiomatic C17 with end-pointer arithmetic, faithful to the original:
1if ((size_t)(r->ep - r->tp) < len) {
2 size_t tot = (size_t)(r->ep - r->bp);
3 if (tot < len) tot = len;
4 uint8_t *nbp = realloc(r->bp, tot * 2); // tot * 2 is not overflow-checked
5 if (nbp) {
6 r->bp = nbp;
7 r->ep = r->bp + tot * 2;
8 }
9}Correct as long as tot * 2 does not overflow size_t, which is left to
developer discipline.
Codex saw that buffer-manipulation logic was repeated across many of the
tools and extracted it into a structured buffer library. The same pattern
BoringSSL uses for CBB and s2n-tls uses for its stuffer. The growth helper at
the heart of that library checks for overflow explicitly:
1while (cap < need) {
2 if (cap > (SIZE_MAX / 2U)) { // explicit overflow guard
3 bb_die("buffer size overflow");
4 }
5 cap *= 2U;
6}
7buf->data = bb_xrealloc(buf->data, cap);Allocations go through bb_xmalloc / bb_xrealloc wrappers that abort on
NULL; errors go through bb_die. One buffer abstraction, audited once, used
everywhere.
Pi uses a similar tbuf_ensure helper but abort()s on OOM.
ASan-on-by-default catches the rest at runtime.
Allocation handling shows the same split. Claude Code null-checks every malloc
site by hand:
1out = malloc(need + 1);
2if (out == NULL) {
3 fprintf(stderr, "b64: malloc failed\n");
4 return 1;
5}Codex makes that impossible: direct malloc does not appear outside the
wrapper.
1void *bb_xmalloc(size_t size) {
2 void *p = malloc(size ? size : 1U);
3 if (p == NULL) {
4 bb_die("out of memory");
5 }
6 return p;
7}Other notes. Claude Code is the only one that kept the original’s custom
“Fake64” alphabet; Pi swapped it for standard RFC 4648 Base64 (cleaner, no
longer bit-compatible with the original). Codex is the most consistent user of
<stdbool.h> and <stdint.h> throughout. Claude Code returns error codes from
individual functions and leaves callers to handle them; Codex and Pi fail fast.
Ranking on defensive style: Codex, Pi, Claude Code. Ranking on feature completeness: Claude Code, Pi, Codex. The inversion is not an accident.
Time and token cost
| Claude Code | Codex | Pi | |
|---|---|---|---|
| Output tokens | ~875,000+ | ~119,718 | ~144,000 |
| Active work time | ~2h 30m+ | ~2h | ~1h 45 min |
| Tool parity | 100% (38/38) | 84% (32/38) | 87% (33/38) |
Claude Code used roughly 7× more output tokens than the others. The delta lines up with the difference in scope: those tokens went into the ASN.1 suite and the network utilities Codex and Pi skipped. Wall-clock time was closer than the token gap suggests. Claude Code ran for about 30 to 45 minutes longer than Codex and Pi, not 2× or 3× longer. Codex and Pi stopped short, and the last 15% of any codebase is the hardest 15%.
Observations and conclusions
Three agents, one codebase, one prompt. The interesting result is not that one of them won. It is that they split cleanly along three axes:
| If you want… | Use |
|---|---|
| Full feature parity with original Blackbag | Claude Code |
| Production-grade defensive C style | Codex |
| Aggressive runtime guardrails for development | Pi |
The agents differ less in how well they write C than in how they interpret scope and risk. Codex and Pi both stopped well short of complete, but neither did so randomly. Codex stopped at the line drawn in the Makefile. Pi stopped at the boundary where its own plan got harder. Claude Code went past both lines.
Given the same prompt, the three made very different judgement calls about where “done” is. That is harder to evaluate than warning counts, and probably the part that matters more.
#C #Modernisation #Legacy-Code #Ai-Assisted #Claude #Codex #Pi #Blackbag #Gcc #Sanitizers