Open-Source Software: A Promising Piece of the Digital Preservation Puzzle

Cal Lee
University of Michigan
School of Information

[This document reflects the text of the article as originally written, though I have corrected a few typos, updated one link and made some minor formatting changes for easier online navigation. A slightly different version appears as "Open-Source Software: A Promising Piece of the Digital Preservation Puzzle," Electronic Currents, Midwest Archives Conference (MAC) Newsletter, Volume 29, Number 2 (113), October 2001, 26-28.]

In this essay, I hope to explain what source code is and why it matters to archivists concerned with digital preservation. Given the importance of source code, open-source software (OSS) development holds great promise for facilitating our efforts to keep digital materials accessible into the future.

A Brief Look at our Friend, The Computer

Computers don't understand what we want them to do. Well, not directly, anyway. This is because, at their most basic level, computers are just collections of devices that deal with streams of electricity. Through numerous magical tricks, electricity can be broken into tiny little chunks. Engineers and technicians who develop, evaluate and maintain the physical devices must often deal with aspects of the electricity itself. Most computer professionals, however, can ignore those physical details and instead deal with the chunks, which they treat as signals or bits (binary digits), conveying one of two possible values (1 or 0, true or false, on or off). By processing, sharing and storing combinations of these signals, the devices can perform many different instructions.

The story can't really end there, though, since not many people are really very good at thinking in terms of those instructions either. What the end user really wants to do is perform some task like write a letter, listen to some music, or (we would hope) find the hours of operation of an archival repository. If doing such things through a computer required an intimate knowledge of how all the bits were being created and processed, it would hardly be worth the trouble. It would be much more helpful for us to be able to convey our needs to a computer in a way that made sense to us, and then let the computer figure out how to translate that into the appropriate bits. Luckily for us (at least, when it all works), modern computers have numerous components in place to do exactly that.

When I use the mouse to move the pointer over a file name and then double-click on it, for example, an image then appears on my monitor that looks to me like a document. I can make some changes to the document and then save it, allowing someone to view that changed version some time in the future. By anticipating the sorts of tasks people will want to perform, computer engineers have built a complex system that translates my mouse click into the appropriate generation and processing of bits. This is done partially through hardware (the devices themselves), but mostly through software (combinations of bits that can perform operations on other bits). Since different pieces of hardware and software deal with bits in very different ways, this process actually has to take the form of a huge number of tiny little translations.

In order to make this massive job much more manageable, developers take advantage of an extremely powerful concept known as abstraction. This allows different parts of the problem to be addressed as layers, which then "talk" to each other. If I'm building a component in layer X that has to work with some other component that you're building for layer Y, I don't need to know all the intimate details of how your component works or even everything about how layer Y works. I just need to know how to talk to it. The point of interaction between two layers is called an interface, and the conventions for communication across the interfaces are (generally) called protocols. For most computer systems, the very highest layer is the one with which humans directly interact. If the interaction is through windows, icons, mouse and pointer (WIMP), then it is called a graphic user interface (GUI). If all interaction is through typing commands with the keyboard, which then get interpreted by a command interpreter (or shell), then this is called a command line interface. Interaction between end users and today's computers often actually involves both types of interaction.

So What is Source Code?

In order to perform most of the functions described above, programmers create software by writing long series of instructions, called source code, conforming to a specific programming language, such as C or C++. This programming language is the programmers' way of speaking to the computer. By translating the tasks that they think the human users will wish to perform (e.g. open a document) into this more formal language, programmers are acting as intermediaries between users and lower levels of the system.

As described above, however, computer devices don't deal with programming languages, they deal with bits. Source code must go through several steps before it becomes an executable program. Though this process varies from one language to another [1], it often involves passing the source code through a compiler, which creates object code and then passing the object code through a linker, which combines the various modules to create machine code. There are various types of machine code, each of which can only be understood by computers that have a specific type of central processing unit (CPU), e.g. the Intel x86. If one wishes to run a program on a computer with a different type of CPU, it must be either rewritten or recompiled from the source code.

What is Open-Source Software?

Open-source software (OSS) is software for which the source code is freely available for anyone to see and manipulate. There are various licensing models to which the OSS label has been applied, but the basic idea is that the software's "license may not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs" and the working software must either be distributed along with its source code or have a "well-publicized means of downloading the source code, without charge, via the Internet." [2] That is, anyone can access and manipulate the code that was used to write a program, as long as anything that person comes up with using that code is also offered to the public as OSS. This allows those who use the software to contribute to its further development, fix bugs and tinker with it as they please. This is contrasted with proprietary software, which is distributed as compiled object code or machine code, leaving the source code solely under the control of the individual software vendor.

Is there any OSS Available that Really Works?

Funny you should ask. OSS has received a great deal of media recognition in the past few years, though many of its products have been around for quite some time. Just a few prominent examples include:

Apache - the most popular Web server software in the world.
Sendmail - the mail relay application installed on the majority of mail servers,
BIND (Berkeley Internet Name Daemon) - the program used most widely for turning host names into internet protocol (IP) addresses.
Perl - a scripting language interpreter used extensively for Web-based applications.
GNU (GNU's not Unix) - a project that, since its inception in the mid-1980s, has created a number of tools that are quite popular, particularly among software developers.
OpenOffice - a project through which Sun Microsystems has released the technology for the StarOffice suite.

By far the most widely publicized example of OSS, however, has been Linux, which is a Unix-like operating system (OS). Its kernel (the core portion of an OS) uses no code from proprietary sources and draws heavily from code developed by the GNU project mentioned above. Linux was originally created by Linus Torvalds at the University of Helsinki in Finland, but its development since then has benefited from thousands of developers, distributed all over the world. Much of this work has been volunteer, though a number of successful businesses have arisen to distribute, develop and provide support for Linux and other OSS. IBM has dedicated $1 billion to its work on Linux this year.

In addition to encouraging others to use OSS, so that it will be easier to preserve their materials, there are many ways in which we, as archivists could use OSS:

management of our web sites,
collection management and online catalogs, using software such as that documented by oss4lib (see below),
manipulation, marking up and parsing of text, such as with encoded archival description (EAD) and Dublin Core,
building the infrastructure for preserving digital materials (see e.g. the LOCKSS project below), and
any other efforts that require us to develop or contract out the development of software that could benefit our entire profession.

Why Should We Care?

There are two major reasons why OSS can support digital preservation efforts:

Having access to the source code for software allows one to see better how it works and change it, if necessary.
The license agreements associated with OSS make it easier to take preservation measures without fear of violating the intellectual property claims of the original developers.

The various layers of technology I described earlier do not stay the same for very long. Changes made in one layer will often raise compatibility issues with other layers. This means that software that has been compiled to run on a specific hardware platform of today will not run on hardware platform is in use 50 years from now, nor will it probably be able to interact with any number of other necessary software components. If all someone has in 50 years is the machine code, and the older hardware can no longer be used, then we will be unable to read any of the documents that rely on that software. If the original vendor has gone out of business, is uncooperative or simply demands more financial compensation than archivists can afford, then getting that old software to run on the new computers will be extremely difficult.

Most vendors have very little interest in continuing to support their older software indefinitely, though they do have a tendency to legally challenge anyone else who attempts to do so. With federal legislation such as the Digital Millenium Copyright Act (DMCA) and the possibility of numerous states passing the Uniform Computer Information Transactions Act (UCITA), the law would seem to be weighing heavily on the side of those who own the intellectual property rights to software. Regardless of whether one uses emulation (using software to imitate all or part of the original platform) or migration (transferring completely to a new platform), both access to the original source code and freedom from legal challenges can be indispensable.

Of course, OSS is not a panacea. Just as putting hinges on the hoods of cars hasn't turned everyone into an auto mechanic, providing the source code to a piece of software will not suddenly result in every user becoming a software developer. Writing, debugging and maintaining source code is hard, specialized work, and those with such expertise will not work on a project unless they see some value in it. In order for OSS development to be successful, it also requires strong leadership. It won't simply happen on its own.

There are also numerous organizational, economic and technical issues related to digital preservation that won't go away, simply because we use OSS. It is simply one piece in a much bigger puzzle. As a profession committed to promoting future access to culturally significant documentation in all of its forms, however, we should become much more intimately aware of and engaged in OSS activities. I would rather not be the one who has to explain to archivists in 2050 why we didn't.

For Additional Information:

Digital Millenium Copyright Act (DMCA)

http://www.loc.gov/copyright/legislation/dmca.pdf (summary)
http://www4.law.cornell.edu/uscode/17/1201.html (as enacted)

Electronic Recordkeeping Resources, Section on OSS
http://www-personal.si.umich.edu/~calz/ermlinks/oss.htm

Lots of Copies Keep Stuff Safe (LOCKSS)
http://lockss.stanford.edu/

Open Source Systems for Libraries (oss4lib)
http://www.oss4lib.org/

Footnotes:

(1) Programmers can also use interpreted languages, which are translated into machine-readable bits in a different way, often involving an intermediate form called byte code. Yet another option is assembly language, which is quite similar to the final machine language but can still be read and understood by a human programmer who is familiar with the instruction set of a given processor. Writing in assembly language is slow and laborious, so programmers generally only do so in very limited situations. [Return to Main Text]

(2) Bruce Perens, "The Open Source Definition," Version 1.3. [Editorial Note to the Online Version: The wording of the definition has since changed and can be found at http://www.opensource.org/docs/definition_plain.html] [Return to Main Text]