Old code is an antique. Tangled history finally gets interesting.
the light is on. come in.
Source code from before the 1990s was written long before the age of AI. It is a software antique.
Hand-typed, made to work, left behind, and forgotten. We collect it from dead platforms before it disappears, and explore creative ways to use it today.
“Clean OSS is edited literature. Legacy code is field recordings.”
Collect first
Quietly, before it disappears
Find interesting uses
Creatively, without constraints
Answers come later
The time spent reading is part of the value
27B
unique source files
Software Heritage
220B
lines of COBOL
still in production
30-50%
industrial dead code
nobody understands
50%+
OSS projects die
within first 4 years
OUR EXCAVATION
10,540
repos discovered
9 languages
2,585
repos cloned
3.7 GB of fossils
74,433
TODO/FIXME/HACK
comments extracted
3.79%
Perl frustration rate
highest of all languages
6.6yr
code half-life
Linux Kernel
4mo
code half-life
Angular (20x shorter)
3x
more predictable
code vs English
Excavation records and deliverables
Sentiment analysis of comments extracted from 10,540 repositories across 9 languages. Perl's frustration rate: 3.79%. The most common word: “should”—the gap between ideal and reality.
Collection scripts, analysis pipelines, and excavation tools. TODO archaeology, sentiment analysis, antique auto-classification, and more.
An interactive exhibit of excavated comments. Human vs AI interpretation. Code archaeology quizzes. “What was this developer thinking?”—a place to enjoy questions with no answers.
Collect, interpret, share. There's always more to uncover.
74,433 comments excavated. Developer struggles fossilized in code, finally unearthed.
AI analyzes structure, humans read context. A code appraisal report series.
An interactive web app exhibiting excavated comments. Human vs AI interpretation. Code archaeology quizzes.
An antique appraiser that estimates the age, origin, and context of unknown fragments.
Track the rise and fall of paradigms across 20 years of code. A natural history of software.
Turn the rhythm and structure of old code into sound. Listen to the legacy.
☙ FOUND IN: payroll_calc.c — last modified 2003-04-12
// ============================================
// FIXME: this workaround has been here since 1998
// Original author: unknown (left company in 2001)
// Last modified: 2003-04-12
// Nobody knows why removing this breaks payroll
// ============================================
if (month == 2 && day > 28) {
day = 28; // TODO: handle leap years properly
// HACK: just... don't deploy in February
}
// See you space cowboy...Do you have code like this sleeping somewhere? We are pretty good at reading it.
DISCOVER
API / Archive index
CLONE
--depth 1
EXTRACT
Metadata + cloc
SCAN
Smells + Secrets
STORE
Parquet + Raw
Frequently Asked Questions
Legacy Code Archive is a software archaeology project that systematically collects and preserves legacy code from the pre-AI era (roughly the 1980s to the 2000s). We rescue code sleeping in long-forgotten platforms before it disappears, and explore creative ways to use it.
We focus on pre-AI era code. Sources include Google Code Archive (1.4M projects), CodePlex Archive (108K repositories), SourceForge (~500K projects), abandoned GitHub repositories, academic code, government OSS, and retro/demoscene works (1980s–2000s).
Old code can be a “software antique” that is often missing from modern AI training data. TODO comments preserve real developer struggles like fossils, and recurring code-smell patterns can inform today’s software quality work. Tracing paradigm shifts across decades of code also has academic value as a natural history of software.
We explore both research and creative directions: TODO Archaeology (comment fossil records), Code Archaeology AI (estimating age/origin/context of fragments), Software Natural History (paradigm shift studies), Before/After pair datasets (refactoring comparisons), Code Sonification (turning structure into sound), and “Legacy Whisperer” (training AI to understand messy code).
Yes. We welcome information about old codebases, research collaboration, and introductions to sources. Please contact us and select “Legacy Code Archive” in the contact form. We focus on code with explicit licenses in public repositories.
Software Heritage is a large-scale archive preserving tens of billions of source files. Legacy Code Archive is also about creative reuse: interpreting and exhibiting old code through methods like TODO archaeology and code sonification, and discovering new value in “antiques,” not only storing them.
GET IN TOUCH
Whether it’s about collection/research, or “please do something about this code,” we’d love to hear from you.
CONTACT US→LEGACY CODE ARCHIVE
The antiques dealer doesn't explain.
They only say, “This is good, isn't it?”