PHASE: COLLECTION

LEGACY
CODE
ARCHIVE

Old code is an antique. Tangled history finally gets interesting.

Software Antiques Collection — Where tangled history finally gets interesting.

SCROLL
// TODO: fix this before release (2003-04-12) · /* HACK: temporary workaround for IE6 */ · # FIXME: this will break in Y2K · REM KLUDGE: don't ask why this works · ; WORKAROUND: compiler bug in Turbo Pascal 5.0 · // TODO: remove before shipping (1997-08-23) · /* XXX: whoever wrote this, I'm sorry */ · // TODO: fix this before release (2003-04-12) · /* HACK: temporary workaround for IE6 */ · # FIXME: this will break in Y2K · REM KLUDGE: don't ask why this works · ; WORKAROUND: compiler bug in Turbo Pascal 5.0 · // TODO: remove before shipping (1997-08-23) · /* XXX: whoever wrote this, I'm sorry */ · // TODO: fix this before release (2003-04-12) · /* HACK: temporary workaround for IE6 */ · # FIXME: this will break in Y2K · REM KLUDGE: don't ask why this works · ; WORKAROUND: compiler bug in Turbo Pascal 5.0 · // TODO: remove before shipping (1997-08-23) · /* XXX: whoever wrote this, I'm sorry */ ·
01

WHAT IS THIS

Source code from before the 1990s was written long before the age of AI. It is a software antique.

Hand-typed, made to work, left behind, and forgotten. We collect it from dead platforms before it disappears, and explore creative ways to use it today.

“Clean OSS is edited literature. Legacy code is field recordings.”

Collect first

Quietly, before it disappears

Find interesting uses

Creatively, without constraints

Answers come later

The time spent reading is part of the value

02

INTERESTING NUMBERS

27B

unique source files

Software Heritage

220B

lines of COBOL

still in production

30-50%

industrial dead code

nobody understands

50%+

OSS projects die

within first 4 years

6.6yr

code half-life

Linux Kernel

4mo

code half-life

Angular (20x shorter)

3x

more predictable

code vs English

03

COLLECTION SOURCES

AGoogle Code Archive1.4M projects2006 - 2015PRIORITY
ACodePlex Archive108K repos2006 - 2017ARCHIVED
ASourceForge~500K projects1999 - presentACTIVE
BGitHub AbandonedMillions2008 - presentFILTERING
CAcademic Code∞ (74% broken)VariousRICH SPAGHETTI
DGovernment OSS17K+VariousPUBLIC DOMAIN
ERetro / DemosceneCultural heritage1980s - 2000sHISTORICAL
04

WHAT WE MIGHT DO

You don't have to decide yet. Some things only become visible after you collect enough.

01

TODO Archaeology

A fossil record of developer struggles etched in comments. Surprisingly untouched territory.

NOVELTY 10/10LOW COST
02

Code Archaeology AI

An antique appraiser that estimates the age, origin, and context of unknown fragments.

NOVELTY 10/10
03

Software Natural History

Track the rise and fall of paradigms across 20 years of code. A natural history of software.

ACADEMIC
04

Before / After Pairs

A parallel corpus of spaghetti-to-clean refactors. Fun even to just browse.

FEASIBILITY 9/10
05

Code Sonification

Turn the rhythm and structure of old code into sound. Listen to the legacy.

CREATIVE
06

Legacy Whisperer

What happens if we train AI that only knows clean OSS on messy real-world code?

HIGH IMPACT

☙ FOUND IN: payroll_calc.c — last modified 2003-04-12

// ============================================
// FIXME: this workaround has been here since 1998
// Original author: unknown (left company in 2001)
// Last modified: 2003-04-12
// Nobody knows why removing this breaks payroll
// ============================================
if (month == 2 && day > 28) {
    day = 28; // TODO: handle leap years properly
    // HACK: just... don't deploy in February
}
// See you space cowboy...

Do you have code like this sleeping somewhere? We are pretty good at reading it.

05

COLLECTION PIPELINE

01

DISCOVER

API / Archive index

02

CLONE

--depth 1

03

EXTRACT

Metadata + cloc

04

SCAN

Smells + Secrets

05

STORE

Parquet + Raw

06

FAQ

Frequently Asked Questions

Q01What is Legacy Code Archive?+

Legacy Code Archive is a software archaeology project that systematically collects and preserves legacy code from the pre-AI era (roughly the 1980s to the 2000s). We rescue code sleeping in long-forgotten platforms before it disappears, and explore creative ways to use it.

Q02What kind of code do you collect?+

We focus on pre-AI era code. Sources include Google Code Archive (1.4M projects), CodePlex Archive (108K repositories), SourceForge (~500K projects), abandoned GitHub repositories, academic code, government OSS, and retro/demoscene works (1980s–2000s).

Q03Why does old code matter?+

Old code can be a “software antique” that is often missing from modern AI training data. TODO comments preserve real developer struggles like fossils, and recurring code-smell patterns can inform today’s software quality work. Tracing paradigm shifts across decades of code also has academic value as a natural history of software.

Q04How will the collected code be used?+

We explore both research and creative directions: TODO Archaeology (comment fossil records), Code Archaeology AI (estimating age/origin/context of fragments), Software Natural History (paradigm shift studies), Before/After pair datasets (refactoring comparisons), Code Sonification (turning structure into sound), and “Legacy Whisperer” (training AI to understand messy code).

Q05Can I contribute code or information?+

Yes. We welcome information about old codebases, research collaboration, and introductions to sources. Please contact us and select “Legacy Code Archive” in the contact form. We focus on code with explicit licenses in public repositories.

Q06How is it different from Software Heritage?+

Software Heritage is a large-scale archive preserving tens of billions of source files. Legacy Code Archive is also about creative reuse: interpreting and exhibiting old code through methods like TODO archaeology and code sonification, and discovering new value in “antiques,” not only storing them.

GET IN TOUCH

If this sparked your curiosity, let's talk.

Whether it’s about collection/research, or “please do something about this code,” we’d love to hear from you.

CONTACT US

LEGACY CODE ARCHIVE

The antiques dealer doesn't explain.

They only say, “This is good, isn't it?”