Best viewed in a browser that renders legacy HTML properly. IE 10/11 users must switch to an alternate.
Vintage Computer Festival English Deutsch Español Français Italiano Nederlands Portugues Arabic New Home Contact
Vintage Computer Festival

The Valley of Lost Data
Excavating Hard Drives and Floppy Disks

Sellam Ismail and Dr. Christine Finn (sellam@vintage.org)
Vintage Computer Festival

October 27, 2001

Events

Blog

Library

Gazette

Gallery

Projects

Donate

Sponsors

Press

Mailing List

Links

FAQ

Contact

Login

This paper was presented at the Archäologie und Computer conference at Forschungsgesellschaft Wiener Stadtarchäologie in Vienna, Austria on November 5-6, 2001.

This paper explores some ideas concerning computers as repositories in the sense of archaeological sites containing data - objects, bodies and so on. The authors, who are both involved in computer history (one an archaeologist and writer on the subject, the other a leading collector and curator of vintage computers in the US) will consider the ways in which material in a computer may be accessed in the manner of evidence extracted from sites in a more orthodox archaeological situation.

The paper will look at the type of data stored, the hardware and software implicated, and the ways in which it can be retrieved. They will discuss issues concerning the ethical retrieval of such data and the time-frame involved in the transformation of material to a state of inaccessibility.

Introduction

The purpose of this paper is to discuss a burgeoning field of archaeology that has only some relevance today but will surely become increasingly significant as time goes by.

Archaeology on computers, where instead of using computers as tools for archaeology, we will be performing archaeology on computers themselves, specifically the data they and their associated data storage devices contain, will grow as a field of study, and in fact is already practiced today in a limited context.  In simpler terms, the computer becomes the artifact.

A Treasure Trove of Sociological and Anthropological Data

The impact of social changes brought about by the use of computers has been steadily monitored over the past 30 years.  The personal computer and the advent of the Internet have inspired studies of how families, individuals and workers use new technology.  The study of the material culture of Silicon Valley evidences the transformation of a cutting-edge idea into an everyday tool and finally into an obsolete hulk of hardware.  Some machines earn new value as collector's items and find homes in museums.  But what we are looking at in this paper is the lost data hidden in old machines; material which is barely decipherable and only then by those who have the code: a sort of Rosetta stone for the 21st century.

An amazing amount of information can be culled from the hard drive of an old computer.  A year ago, I [Sellam Ismail] brought home a computer from a charity shop near my house.  It was an Apple Macintosh Classic II model.  It contains an internal hard drive with the operating system.  Though the computer is about 10 years old, it still worked perfectly (as most old computers do…they are actually hard to kill).  I booted it up and began to explore the hard drive.  I found that the previous owner had failed to take into the account the possibility that their computer would one day be used again by a complete stranger and had neglected to delete any of their personal files from the machine.  I was soon wading into the life of a person who lived not too far from me (on a street around the corner in fact).

On the hard drive I found personal letters, resumes (CVs), job applications, financial records such as bank balances and other holdings, digital photos, and email.  As a researcher, I could use this information to reasonably reconstruct the life of the person who used this computer before.

Use It, But Don't Abuse It

On a darker note, I could also use this information maliciously, which highlights the need for people to be wary about what they do with their old computers and data before sending it off to the skip or donating it to a charity.  This is a double-edged sword.  On the one hand, people definitely need to control their personal data.  Had I been a psychopath, I very well could have used the data I found on the hard drive of the Macintosh to reign information terror on my neighbors.  But on the other hand, mining the hard drive of a discarded computer can produce a rich array of information that can yield many clues about a particular culture, and in this case a particular individual or family.  In this day and age of electronic everything, especially in modern Western cultures, more and more people turn to using computers for everything from writing letters to managing their finances to even storing their photos and music.  Most personal correspondence is in the form of ethereal e-mail and even more fleeting "instant messaging".  The only records of today that future researchers may have of our culture will be in the form of electronic letters and records.

Of course, it comes down to a matter of ethics.  As researchers and scholars, we must make every effort to preserve the privacy of the subjects we study as much as we preserve the artifacts that hold that private data.  In perhaps 100 years, it won't be as critical, as the relevance of the data under study may well be minimal, if it has any at all.  However, this issue does need to be addressed now.

Computer collectors have already debated this issue and have come up with the Classic Computer Collector's Code of Conduct1:

  • I will do my damndest to find a home for any classic or unwanted computer.
  • I will return or destroy any personal or commercially sensitive data I find on a machine I acquire, and will keep it in the strictest confidence, should I find it necessary to view it.
  • I will aid users in the decommissioning of their machines, should they require assistance.
  • I will respect active software and publication copyrights.
  • I will, whenever possible, repair the computers in my collection and maintain them in working order, and will assist others in doing the same, to the best of my ability. I will actively encourage the repair, maintenance, and use of older computers, in preference to the irreversible alteration of machines and parts for non-computer applications.
  • I will actively promote the exchange of computers, parts, and information among collectors, and will refrain from hoarding multiple examples of any item.
  • I will actively promote ethical collecting.

Media Preservation

Another concern for us is the lifespan of data storage media, both from a practical point of view and a historical perspective.  Data stored on magnetic mediums such as floppy disk or tape will eventually decay and become unreadable.  Estimates for the longevity of data stored on floppy disk range from 15-202 years, which means a lot of the data stored on floppy disks from the 80s is theoretically beginning to sour.  Even CDs aren't forever.  Estimates gleaned from accelerated testing show that data stored on a CD-ROM will only last for about 100 years3.  For CD-R (recordable) and CD-RW (read/write) technologies, the data retention estimates are much lower.

The only good long-term media is tried and true paper, which we know can last for hundreds if not thousands of years when properly stored.  Punch cards and paper tape from the 1950s and 1960s are still quite readable.  The only problem with paper tape is that it tends to get brittle if not stored properly.  Even with paper tape or punch cards that would not survive a trip through a reader, non-harmful processes such as optical scanning can be used to read the data from the medium, since data is stored by punching holes in the paper, and these holes can be scanned optically and then interpreted with software, something that cannot be done with magnetic media.

Where this concerns archaeologists is that a lot of the data that will be useful for future study is slowly dying.  Electronic records stored on any kind of magnetic media is at risk of being lost over time.  And as mentioned, not even CDs are immune.  This issue is being discussed and debated with increasing interest due to the inherent timeliness of the matter.

The Association for Information and Image Management International, a subcommittee of the American National Standards Institute, defines "Archival Medium" as the following:

"Recording material that can be expected to retain information forever, so that such information can be retrieved without significant loss when properly stored.  However, there is no such material and it is not a term to be used in American National Standard material or system specifications."

ANSI/AIIM TR2-1998, "Glossary of Document Technologies", Association for Information and Image Management International, Silver Springs, MD

As we can see, even scientific bodies charged with determining ways to make data last indefinitely are stumped by the problem.

Imaging Standards in Development

Collectors of old computers have begun to tackle this problem by proposing an imaging standard for media of all types.  The standard was initially geared towards the preservation of floppy disks used by old computers, but it was soon realized that the standard should be expanded to accommodate any type of media, be it floppy disk, magnetic tape, paper tape, punch card, etc.

The current implementation is a Markup Language, which allows flexibility in the description of the media, as well as the storage of the actual data.  An example for a floppy disk would be as such:

<MEDIA TYPE=FLOPPY SIZE=5.25 SIDES=1 DENSITY=SINGLE
FORMAT=GCR TRACKS=35 SECTORS=16 SECTORSIZE=256>
<VOLUME>Apple ][ System Disk</VOLUME>
</MEDIA>

<DATA><TRACK 0><SECTOR 0>
HERE WOULD BE THE DATA FOR TRACK 0, SECTOR 0
</SECTOR></TRACK>
...
<TRACK 34><SECTOR 15>
HERE WOULD BE THE DATA FOR TRACK 34, SECTOR 15
</SECTOR></TRACK></DATA>

The above is a simple example demonstrating the imaging of an Apple ][ system diskette.  The media is first defined with the MEDIA tag, including the type of media (floppy diskette), size of media (5.25"), number of sides (in this case only one), density of data stored (single), format of recording (Group Code Recording or GCR), number of data tracks (35), number of sectors per track (16), number of bytes per sector (256).

The data is then laid out between the DATA tags, with the bytes of every sector of every track represented as hexadecimal4 bytes stored in standard ASCII5 format.  This archive would be a symbolic digital representation of the data of the original disk, and could be used to reconstruct that disk at a future time, or could be parsed by software and fed into an emulator program running on a more modern computer platform.

As this standard is developed and then codified, a structured effort to preserve old software can be undertaken and a central repository be established.  This repository can be made available over the Internet, and software and data from bygone eras can be guaranteed everlasting life, but only as long as there are humans around to administer the archive.  This is important, because the data has to be kept "alive" somehow.  Data on the Internet is stored on servers, and these servers are simply computers with hard drives, mechanical devices that themselves will decay and die over time.  Even if this archive were to be backed up to tape or CD, those tapes and CDs are still susceptible to the same basic problem: eventually they too will die.

As you can see, long-term computer data preservation is a problem.  As previously discussed, paper media such as paper tape and punch cards are safer than magnetic or optical medias.  One very long-term solution suggested was to use Mylar punch cards.  Mylar is impervious to just about anything save for high temperature fire.  Data punched into Mylar cards would be safe for millennia.  The problem is that the sheer amount of data needing preservation would require square miles of the stuff and would be extraordinarily impractical so store.

Another factor of obsolescence that one must consider with data archiving is that the devices used to read media are constantly being upgraded and reinvented.  How many people still know where to get their hands on a punch card or paper tape reader?  When is the last time anyone saw even a 5.25" floppy drive in a computer?  It's only a matter of time before the CD-ROM drives that are so prevalent in computers today are superceded by some new technology that will make CDs obsolete.

Below is a chart of data media and the relative scarcity of devices needed to read the media:

Hard to Find

  • Punch card readers
  • Paper tape readers
  • Magnetic tape readers

Becoming Hard to Find

  • 5.25" floppy disks
  • Drives for Magneto-Optical disks
  • Various tape drive formats (QIC)
  • Interfaces for old MFM hard drives

Will Eventually Become Hard to Find

  • 3.5" floppy drives
  • CD ROM drives

A Virtual Cultural Record

One of the most important aspects of our modern culture is the World Wide Web, which by nature is ethereal and fleeting.  No hard copy exists, nor could a hard copy of the WWW reasonably be made.  Fortunately, some institutions are taking it upon themselves to archive the World Wide Web digitally by taking periodic "snapshots" of it.  The most prominent is The Internet Archive: Building an "Internet Library" (http://www.archive.org/).

Their current archive is 40 TERAbytes (forty TRILLION bytes or 40,000,000,000,000) of World Wide Web pages archived since 1996.  They also have other archives of the Usenet, ARPAnet (the precursor to the Internet) historical documentation, and a movie archive.

The Rosetta CD?

An interesting question was posed by an audience member at one of Christine's book readings: how would an archaeologist make sense of the artifacts found from a civilization whose technology was more advanced than our own?

Depending upon the complexity of the artifact, there is a slight chance that we would be able to determine what it was and what it did.  If the technology was just beyond our own, then perhaps we might be able to figure it out, but if it were far beyond our own level of knowledge, there would be little chance of us taking that quantum leap to understanding that technology.  There also might also be some element missing from the artifact, for example an energy form that has long since dissipated.

Imagine a great cataclysm occurs within the next millennium.  Knowledge of previous civilization is lost and technology reverts.  If a CD were to be discovered 1,000 years from now in this post-apocalyptic world, what would archaeologists make of it?  Would they be able to determine that the microscopic pits on the substrate were some form of information encoding?  Would they be able to decode the format of this data and make sense of it?  If the knowledge of the data format is not documented and kept somewhere safe, it'll be like deciphering the hieroglyphics on the pyramids.

Digital Archaeology

Media obsolescence is a real and contemporary problem.  I have been called upon several times to resurrect data from obsolete formats.  One such project was from a contractor to NASA in the United States.  This contractor had program code for the Space Shuttle program that was stored on 8" floppy disks.  8" floppy disks (the original size when floppies were first invented) were standard throughout the 1970s and early 1980s.  I was able to find a computer in my collection that could read the data, and was able to then transfer the original program files from this computer to a modern PC over the serial communications port, and save them to the hard drive of the PC.  I actually e-mailed the files back to the contractor.

Another project I worked on previously was for a geophysicist.  He was attempting to recover data from some 8" disks that held geological data for the country of Guatemala.  This data was originally generated through a United Nations program and would be useful to the geophysicist for detecting mineral deposits within Guatemala's borders.  I was able to identify that there was data on those disks, but the software needed to read that data and make sense of it could not be located.  In this case, it would have taken more time and money than the geophysicist felt was reasonable to attempt to manually decode the data and convert it to a more modern disk format.

A project I am currently evaluating is to convert a music database from an older program to a more modern format.  The client was using a program called Superfile to manage his classical music collection, but he has since lost the original program and is trying to find a re placement.  The original program was written for a CP/M computer (CP/M was the dominant operating system before MS-DOS and Windows took over).  I will be converting the database to a more modern format that will bring him into the modern era, but he will at some point again need to convert this data to a future modern format in order to keep it alive.

Lastly, I am discussing a project with a professor at the University of California at Davis to convert some data he has on punched cards to a modern format.  This will involve connecting a 25+ year old punch card reader to a modern PC and reading the data on the cards, then storing it in a modern electronic file format, and eventually burning it to a CD.  This data will be good for a while, but there may come a day when I am hired again to convert the data from CD to some as yet uninvented format.

Conclusion

Data preservation should not just be a concern for computer scientists.  Valuable archaeological data that will be not only relevant but important for researchers of the future is contained on the media that is already being lost today.  This data must not only be preserved now, but standards and practices must be put into place to ensure that future generations of scholars wanting to learn about the past that is our present are not confronted with the futile challenge of decoding long since dissipated magnetic fluctuations on decaying pieces of plastic.

We are also left with a challenge: what to preserve?  How do we decide what is worth preserving, and how would this skew the way we, as a culture, are viewed many years from now?  While deciding on the data we leave behind, we are also posing questions of the archaeological data we have become used to dealing with.  Was there any selection process before now?  The choice of objects left in tombs and graves is evidence to us, but in our own modern context.  Can we project our self-consciousness about what is personal, valuable or vulnerable onto previous, and long-since lost civilizations?  Archaeologists are always looking for the key to the ancient world, but how do we leave the source code for our own complex society?

Hopefully these issues can be resolved, and soon.  And don't be surprised if the solution ends up being as low-tech as chisels and stones.

References

1. Rothenberg, Jeff, 1998; "Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation" http://www.clir.org/pubs/reports/rothenberg/contents.html

2. International Organization for Standardization: Technical Committee 42-Photography, January 3, 2001; "Imaging materials - Life expectancy of information stored on compact discs (CD-ROM) - Method for estimating, based on effects of temperature and relative humidity" (ISO 18921) http://www.pima.net/standards/iso/tc42/wg05/ISO_18921/ISO_18921.pdf

3. International Organization for Standardization: Technical Committee 42-Photography; "Life expectancy of information stored on magneto-optical discs. Method for estimating, based on the effects of temperature and relative humidity" (ISO 18926)

4. International Organization for Standardization: Technical Committee 42-Photography, June 20, 2001; "Imaging materials - Life expectancy of information stored on recordable compact disc systems - Method for estimating, based on effects of temperature and relative humidity" (ISO 18927) http://www.pima.net/standards/iso/tc42/wg05/ISO_18927/N4767_FDIS18927.pdf

Footnotes

1. This particular version of the CCCCC was developed by Roger Sinasohn <roger@sinasohn.com>, a collector of mostly portable computers from San Francisco, California.  See also Uncle Roger's Classic Computers at http://sinasohn.com/clascomp/.

2. Estimated range based on practical industry experience.

3. Estimates of longevity vary according to each manufacturer.  The figure of 100 years is a generally accepted lifespan for a CD-ROM stored under favorable physical conditions.

4. Hexadecimal is a Base 16 numbering system where digits go from 0 - F (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F)

5. American Standard Code for Information Interchange


Would you like to be notified of VCF events and activities? Sign up for our mailing list!


[Events] [Blog] [Library] [Gazette] [Gallery] [Projects] [Donate] [Sponsors] [Press] [Mailing List] [Links] [FAQ] [Contact]

Copyright © 1997-2014 Vintage Computer Festival
Vintage Computer Festival, VCF and the VCF logo are trademarks of VintageTech