"A lot of people believe that if it's on the Web it will stay on the Web. Chances are that it won't" (Jill Lepore)
This page is far from the only one that isn't available!
"The average life of a Web page is about a hundred days. Strelkov's 'We just downed a plane' post lasted barely two hours. It might seem, and it often feels, as though stuff on the Web lasts forever, for better and frequently for worse. . . . No one believes any longer, if anyone ever did, that 'if it's on the Web it must be true,' but a lot of people do believe that if it's on the Web it will stay on the Web. Chances are, though, that it actually won't."
-- from Jill Lepore's "The Cobweb," in the Jan. 26 New Yorker
We'll come back to "Strelkov's 'We just downed a plane' post" in a moment, because it's a nifty story in its own right, and it introduces those of us who haven't yet had a proper introduction to the Wayback (well, more properly WABAC) machine housed at the Internet Archive in a converted church in San Franciso.
First, though, I should explain that Harvard historian Jill Lepore, in her January 26 New Yorker "Annals of Technology" piece, "The Cobweb," has much larger ambitions than the limited one I've set for us in this post. Jill sets out to answer the question posed in the piece's subtitle: "Can the Internet be archived?"
I think a lot of readers will be interested in the story Jill has to tell about people who are attempting to figure out just how to do that -- i.e., to archive the Internet -- starting with the creator of the storage system that he dubbed the Wayback Machine, Brewster Kahle, the founder of the Internet Archive. (I guess it's properly spelled WABAC, but Kahle is upfront about the device being named for the wondrous machine with which the cartoon Mr. Sherman used to attempt to educate his boy Sherman.)
For that part of the story, however, you'll need to consult the piece itself. What concerns us here is that, as Jill indicates in the passage I've plunked atop this post, a lot of people don't realize just how transitory the Internet is.
In fact, I suspect that a lot of people, not having given it much thought, think that t the Internet is itself an archive -- and depend on it, again without giving it much thought, as such, even though most of us are thoroughly used to encountering error messsages like the Facebook one I've also placed at the top of this post. As Jill says, "It might seem, and it often feels, as though stuff on the Web lasts forever, for better and frequently for worse," and she provides examples. How often do the media regale us with stories of things that people wish desperately to make disappear from the Web?And usually the upshot is that you can't ever get rid of it, no matter how much you might wish you could.
The reality, though, is, Jill says, "The Web dwells in a never-ending present. It is -- elementally -- ethereal, ephemeral, unstable, and unreliable." With regard to the belief "that if it's on the Web it will stay on the Web," she counters, as we've already read, "Chances are, though, that it actually won't." And she provides some charming for-instances.
In 2006, David Cameron gave a speech in which he said that Google was democratizing the world, because “making more information available to more people” was providing “the power for anyone to hold to account those who in the past might have had a monopoly of power.” Seven years later, Britain’s Conservative Party scrubbed from its Web site ten years’ worth of Tory speeches, including that one.
Last year, BuzzFeed deleted more than four thousand of its staff writers’ early posts, apparently because, as time passed, they looked stupider and stupider. Social media, public records, junk: in the end, everything goes.Which is in fact disastrous in many walks of life we may not normally stop to consider.
WHICH IS "A DISASTER" IN MANY AREAS
To begin with, it's not just via deliberate deletion that Web pages become unfindable. Jill is quick to point out: "Web pages don’t have to be deliberately deleted to disappear."
Sites hosted by corporations tend to die with their hosts. When MySpace, GeoCities, and Friendster were reconfigured or sold, millions of accounts vanished. (Some of those companies may have notified users, but Jason Scott, who started an outfit called Archive Team—its motto is “We are going to rescue your shit”—says that such notification is usually purely notional: “They were sending e-mail to dead e-mail addresses, saying, ‘Hello, Arthur Dent, your house is going to be crushed.’ ”) Facebook has been around for only a decade; it won’t be around forever. Twitter is a rare case: it has arranged to archive all of its tweets at the Library of Congress. In 2010, after the announcement, Andy Borowitz tweeted, “Library of Congress to acquire entire Twitter archive—will rename itself Museum of Crap.” Not long after that, Borowitz abandoned that Twitter account. You might, one day, be able to find his old tweets at the Library of Congress, but not anytime soon: the Twitter Archive is not yet open for research. Meanwhile, on the Web, if you click on a link to Borowitz’s tweet about the Museum of Crap, you get this message: “Sorry, that page doesn’t exist!”For consequences that most of us probably haven't thought about, consider pretty much the whole of our judicial system.
The Web dwells in a never-ending present. It is -- elementally -- ethereal, ephemeral, unstable, and unreliable. Sometimes when you try to visit a Web page what you see is an error message: “Page Not Found.” This is known as “link rot,” and it’s a drag, but it’s better than the alternative. More often, you see an updated Web page; most likely the original has been overwritten. (To overwrite, in computing, means to destroy old data by storing new data in their place; overwriting is an artifact of an era when computer storage was very expensive.) Or maybe the page has been moved and something else is where it used to be. This is known as “content drift,” and it’s more pernicious than an error message, because it’s impossible to tell that what you’re seeing isn’t what you went to look for: the overwriting, erasure, or moving of the original is invisible.
For the law and for the courts, link rot and content drift, which are collectively known as “reference rot,” have been disastrous. In providing evidence, legal scholars, lawyers, and judges often cite Web pages in their footnotes; they expect that evidence to remain where they found it as their proof, the way that evidence on paper—in court records and books and law journals—remains where they found it, in libraries and courthouses. But a 2013 survey of law- and policy-related publications found that, at the end of six years, nearly fifty per cent of the URLs cited in those publications no longer worked. According to a 2014 study conducted at Harvard Law School, “more than 70% of the URLs within the Harvard Law Review and other journals, and 50% of the URLs within United States Supreme Court opinions, do not link to the originally cited information.” The overwriting, drifting, and rotting of the Web is no less catastrophic for engineers, scientists, and doctors. Last month, a team of digital library researchers based at Los Alamos National Laboratory reported the results of an exacting study of three and a half million scholarly articles published in science, technology, and medical journals between 1997 and 2012: one in five links provided in the notes suffers from reference rot. It’s like trying to stand on quicksand.
WHICH BRINGS US BACK TO WHERE WE STARTED
Specifically, to "Strelkov's 'We just downed a plane' post" -- a ripping good story I said we'd come back to, ripped out of the headlines, as it were:
Malaysia Airlines Flight 17 took off from Amsterdam at 10:31A.M. G.M.T. on July 17, 2014, for a twelve-hour flight to Kuala Lumpur. Not much more than three hours later, the plane, a Boeing 777, crashed in a field outside Donetsk, Ukraine. All two hundred and ninety-eight people on board were killed. The plane’s last radio contact was at 1:20 P.M. G.M.T. At 2:50P.M. G.M.T., Igor Girkin, a Ukrainian separatist leader also known as Strelkov, or someone acting on his behalf, posted a message on VKontakte, a Russian social-media site: “We just downed a plane, an AN-26.” (An Antonov 26 is a Soviet-built military cargo plane.) The post includes links to video of the wreckage of a plane; it appears to be a Boeing 777.Later Jill tells us:
Two weeks before the crash, Anatol Shmelev, the curator of the Russia and Eurasia collection at the Hoover Institution, at Stanford, had submitted to the Internet Archive, a nonprofit library in California, a list of Ukrainian and Russian Web sites and blogs that ought to be recorded as part of the archive’s Ukraine Conflict collection. Shmelev is one of about a thousand librarians and archivists around the world who identify possible acquisitions for the Internet Archive’s subject collections, which are stored in its Wayback Machine, in San Francisco. Strelkov’s VKontakte page was on Shmelev’s list. “Strelkov is the field commander in Slaviansk and one of the most important figures in the conflict,” Shmelev had written in an e-mail to the Internet Archive on July 1st, and his page “deserves to be recorded twice a day.”
On July 17th, at 3:22 P.M. G.M.T., the Wayback Machine saved a screenshot of Strelkov’s VKontakte post about downing a plane. Two hours and twenty-two minutes later, Arthur Bright, the Europe editor of the Christian Science Monitor, tweeted a picture of the screenshot, along with the message “Grab of Donetsk militant Strelkov’s claim of downing what appears to have been MH17.” By then, Strelkov’s VKontakte page had already been edited: the claim about shooting down a plane was deleted. The only real evidence of the original claim lies in the Wayback Machine.
The day after Strelkov’s “We just downed a plane” post was deposited into the Wayback Machine, Samantha Power, the U.S. Ambassador to the United Nations, told the U.N. Security Council, in New York, that Ukrainian separatist leaders had “boasted on social media about shooting down a plane, but later deleted these messages.” In San Francisco, the people who run the Wayback Machine posted on the Internet Archive’s Facebook page, “Here’s why we exist.”
IT'S POSSIBLE THAT PAGE PRESERVATION
COULD HAVE BEEN DESIGNED INTO HTTP
English computer scientist Tim Berners-Lee, the father of the "hypertext transfer protocol" created "to link pages on what he called the World Wide Web says it was considered. Partly it didn't happen, Jill says, because of "the preference for the most up-to-date information: a bias against obsolescence."
But the chief reason was the premium placed on ease of use. “We were so young then, and the Web was so young,” Berners-Lee told me. “I was trying to get it to go. Preservation was not a priority. But we’re getting older now.”And Berners-Lee isn't alone in his concern. Vint Cerf, another developer who is in at the beginning, and is now Google's "Chief Internet Evangelist," e-mailed Jill: "I worry that the twenty-first century will become an informational black hole."
As she documents, there are people all over the world, in addition to Internet Archive's Kahle, tackling the problem of archiving the Internet. But there is so much information, of so many different kinds, coming from so many different sources, that the complexity of the problem is humongous. Just consider the difference between copyrighted and non-copyrighted material, which dictates entirely different ways of handling the stuff. (Jill describes copyright as "the elephant in the archive.") Or consider that practically every country on the planet has different laws "relating to legal deposit, copyright, and privacy."
National libraries are one place where a lot of archiving is being done, but "they collect chiefly what's in their own domains." And certainly no one else is attempting anything on the scale of the Internet Archive, whose WABAC machine (the name, we're told was designed to sound sort ofl like legendary early compouters such as UNIVAC, but really was borrowed from the Wayback Machine in which that most erudite of cartoon dogs, Mr. Peabody, takes "his boy Sherman" on doggedly educational time trips) captures picture after picture after picture of a growing number of websites around the world. "More than 30 billion Web pages" is the count Jill offers for what IA has archived. (Which of course creates the obvious problem: Once you've preserved all that, er, stuff, how do you find anything in it?)
The original Wayback Machine -- with the
inventor, Mr. Peabody, and his boy Sherman
inventor, Mr. Peabody, and his boy Sherman
A lot of people are working on the problem, and you can read about more of them in the article. You may be buoyed to know that a solution, in the form of what Jill describes as "an excellent patch," has been devised for the vanishing-footnote problem: a collaboratively supported thing called Perma.cc which scholars writing papers can use to create links that really will be permanent (maybe we should say "permanent-ish"). "Perma.cc has already been adopted by law reviews and state courts," Jill tells us, and "it’s only a matter of time before it’s universally adopted as the standard in legal, scientific, and scholarly citation."
Well, that's something. But already how many of us can't access files of our own we created as recently as a couple of years ago? Perhaps they're parked in a storage medium we no longer have access to, or perhaps they're trapped in antique software.
Considering the frantic pace at which new content is being spewed onto the Web, it's not hard to imagine whole new worlds of missing-document pain.