The unpredictable mobility of Internet resources is an inconvenience at best. For librarians, it is a serious problem which compromises their service to patrons and imposes an unacceptably large burden on catalog maintenance. Introduction to PURLS
Contents
The State of the Web
Sadly, the Web is not quite ready to replace books as an information source. Unlike books, web pages tend to move or disappear over time. Web pages can change their content, as well: a URI which once pointed to the ultimate source of gardening tips, might one day refer instead to a homepage for llama breeders. Because information on the Web can change, move or vanish without warning, it cannot be relied upon like a reference book.
Writers since the seventeenth century have extolled the invention of the printing press as the triumph of knowledge and democracy in the earth. The power of the press moved nations to revolution, preserved mankind's most lofty aspirations, and lifted Europe from the ignorance and misery of its Dark Ages. Impressive as this is, the invention of the Internet portends an even greater revolution.
The World Wide Web most perfectly embodies the promise of the Internet. With a small collection of protocols and some gentle standards, information has become accessible to everyone who has access to a computer. No one need miss out. Offering such incredible possibilities, it's no surprise that the Web makes people like Al Gore so giddy that they wish they'd invented it.
A Decade of Challenges
The Web is not far from one decade old. The next decade will decide what becomes of it. Will it continue to explode in popularity, so that future generations will ask, "What was television?" Will it consume our lives until white-collar workers never leave their apartments except on weekends? Or will it vanish back into the academic world that nurtured it for its first few years?
Several shortcomings will need to be addressed before the Web supplants books in our lives. Various groups are already addressing these problems, so we will certainly know the answer, well within the next decade.
404 - Resource not Found
Consider the following common experience. You find a web site filled with just the information you need to shine at your job (or excel at your hobby). For months, you consult the site like a personal mentor. You visit it several times each week. You print the most useful pages. Half of the site is bookmarked in your browser of choice.
One fine May morning, you surf confidently to that Site of Sites. To your horror, you see that dreaded message in your browser: 404 - Resource not Found. What happened? Nothing much. Your indispensable resource just graduated, and the University purged his web page. Will it appear somewhere else? Perhaps at his new job? Who knows?
Books don't suddenly vanish, the way web pages do. In fact, the US government (almost) guarantees it! Every book which carries a US copyright can be found in the Library of Congress. So breathe easy; if some indispensable book goes out of print, it can still be consulted for the cost of a flight to Washington, DC.
If web resources are going to supplant books in our lives, then a mechanism must be found to ensure that they don't simply vanish. Perhaps one day the Library of Congress will expand its mandate to include online resources. Meanwhile, the Web community is managing the problem through redundancy.
If many people agree that a work is important, they will preserve it by "mirroring" it on their own Web sites. Web authors should help encourage this practice by assigning generous copyrights. They should also encourage individuals to download useful content, by collecting it into an archive file.
301 - Resource Moved Permanently
Parallel to the problem of disappearing documents, is the problem of moving documents. In the example above, a Web site may follow its author from school to work, and thence to another job, and so on... With such highly mobile web resources, users can suffer great frustration just locating a site that they know to exist.
On the other hand, as people mirror one another's documents, the opposite problem can arise. Users may perform a query on a search engine and see three hundred of the same document. If that document does not interest the user, a terribly frustrating search may be required, to locate some document that is not a copy of an irrelevant one.
The original intent of the URI was to provide a Uniform (or Universal) Resource Identifier. A URI should identify a document, without regard to its current location, and without regard to which of many mirrors is consulted at a particular time. If this ideal were realized, then changing the physical location of a resource would not affect the users of that resource at all.
In reality, however, a URI serves the opposite of its intended purpose. Instead of identifying a resource, a URI identifies a particular location on the web. When a resource moves, it comes to be identified by a new URI which bears no relation to its old one. Before people can place confidence in the web as a source of information, they must be assured that the information will not vanish, or engage them in a perpetual hide-and-seek.
Today partial solutions exist for the first problem. Companies like pobox.com offer a forwarding service, so that a web site may be permanently accessible through a specific URI. Some organizations, such as the Online Computer Library Center, offer a similar service free of charge. Such services insulate users of the Web from the bother of chasing mobile web sites. Site authors must decide for themselves that such stability is a priority, and must assume the effort and cost of securing it.
The second problem can be addressed by authors, but only in a clunky way. Just as electronic mail messages can (theoretically) be uniquely identified by their message ID, authors can uniquely identify their documents by creating a document ID. When a document is mirrored on many sites, it could still be identified by its document ID. Derived works would be given a new document ID, and could identify their parent work by its document ID. If such a system came into widespread use, then search engines could use document IDs to group mirrored copies of the same document.
A document ID can be any string which uniquely identifies a document. For example, a 128-bit truly random number would certainly suffice. Two obvious methods exist for creating unique document IDs. One is to compute a cryptographically strong hash based on the original content of the document. The unique ID then becomes that hash value. The other is to identify a document using its creation date, author's name, and title.
Note that plagiarists can easily defeat document identification by copying a document, and changing its ID. We are concerned with identification rather than authentication in this essay.
Summary
Web technology still has some way to go before it is stable enough to replace the public library. Most of the challenges discussed in this essay are already being addressed by various projects around the Internet. Until those projects start to bear fruit, web authors must maintain a long-range perspective when designing and publishing their efforts, if they wish to create works of enduring value.
