Spark

Something's rotten in the state of the internet, and archivists are worried

How do we preserve 'the record' when web pages get taken down, reassigned, or simply disappear?

So-called link rot and content drift mean important online content is getting lost to time

When web pages get taken down, reassigned, or simply disappear, how do we preserve the record for future reference? (Katherine Holland/CBC)

You may not know it by name, but you've probably experienced "link rot."

It's when you click on a link in, say, a newspaper article and you get that '404' error—in other words, the page you're trying to go to no longer exists. Or perhaps it's a different page entirely, with the same URL.

But what's just a minor inconvenience to most of us presents an immense challenge for archivists who are trying to preserve the record of life in the digital age.

Time was, print material was properly filed and made accessible to future historians and researchers. But what happens when those documents exist only online, and they're disappearing?

"It is, unfortunately, quite pervasive," said Clare Stanton, part of a team at Harvard University's Library Innovation Lab. She co-authored a new report called "The Paper of Record Meets an Ephemeral Web," which examined more than two million web links in the New York Times, dating back to 1996, "which is a lot of content," she told Spark host Nora Young.

Clare Stanton recently co-authored a paper about the ephemerality of the web. (Drew Silva)

They split the types of links into "shallow links," such as the home page of a site, versus "deep links," which point to a specific page nested within a site.

And unfortunately, the news isn't good, if you pardon the pun.

"We found that 25% of all links that were used on the New York Times were completely inaccessible," she said. It gets worse over time, she added, meaning that links from 1996 are more likely to have rotten links in them than more recent articles.

Counterintuitively, government and university websites were among the least reliable, Stanton said. She believes that because the web addresses may not change, the content on those sites is rearranged, and the content of those pages changes.

"Oftentimes, the entire website structure will be redone. There are different sections of Joe Biden's White House website than there were for Donald Trump's White House website," she said.

In private pages, link rot and content drift are also an issue, likely because most web page creators aren't thinking about archives when they decide what to keep on their site, Stanton added.

"They're thinking about, 'what's the best way for someone to navigate my website', and if people are constantly improving how they have their homepage put together so that you can find the latest COVID-19 statistics, for example, and there's going to inevitably be updates. And that's a good thing for the internet. But it's not necessarily a good thing for an article that wants to point to something particular."

It's also true that website managers take pages down when they're no longer useful for their original purpose, or just stop paying for the domain name, she said. "But it also means that the memory institutions that traditionally have held together the historical record, like libraries, and archives, and museums don't have as much access or control or serendipity and finding things being leftover. There's not a trunk of newspapers somewhere or books that can't be lost to time."

So what's the solution? Some newspapers, like the New York Times, do a good job of preserving their content with an internal archive. But many smaller institutions don't have the resources to do digital preservation of their online content, she said.

There are projects like The Internet Archive that try to preserve web pages for future historians and archives. Stanton's own team has a project called Perma.cc that aims to create web archives for the purposes of citation. But it's a challenge, she acknowledged, because of the distributed nature of the internet.

"If everything exists exclusively on the web, and there's no effort or ability to preserve that in a real way, we're going to lose that type of knowledge about ourselves from the past."


Written and produced by Adam Killick.

now