Ideas for free
Structured archival, and the web as it once was

by Samir Talwar

Monday, 23 December 2024 at 17:00 CET

Sometimes, I have ideas, and while it’s unlikely I’ll ever pursue them, I can’t stop thinking about them. So I write them down. I hope to get them out of my head, and ideally, into the head of someone else’s, who might benefit from them. If you want to make this, or even just take one small idea from the pile, please do. It’s yours.

A couple of months ago, the Internet Archive was hit by a massive, long-running Distributed Denial of Service (DDoS) attack, taking down the Wayback Machine as well as many of their other services.

While this was a catastrophe for the open web, it also made something very clear to me (and others): we rely far too much on a single institution with too many points of failure. Even if you disagree with the Archive’s policies, it’s hard to dispute that the Wayback Machine is a singular resource that is relied upon by many individuals and organisations, with very little funding and a lot of threats. Taking it offline would mostly serve the few, not the many.

So how can we ensure that an archive is preserved without relying on a single institution, registered in California?

Opt into preservation

Here’s my big idea: make it consensual.

Allow website authors to produce their own archive material: a stream of publications that they deem worth preserving. Many of us already do this: it’s called a feed, and it’s typically represented with RSS or Atom. (Here’s mine.)

Feeds are wonderful, because they’re lightweight. They don’t come with all the bloat that a website offers: there’s no JavaScript, no CSS, and a very small subset of HTML. It’s just content.

So let’s archive the feeds!

Of course, there are downsides. They don’t contain all the content; on some websites, it’s just the first paragraph of the article. There’s embedded images and videos which are separate, linked by URLs in the feed. Sometimes a document relies on presentation or interaction, so stripping the CSS and JavaScript makes it worthless. There’s plenty of reasons why this might not work. I propose that we simply… ignore them. Those are good web pages, but this is not for those web pages. At least, not at first.

There’s a bigger problem: a feed is just the latest posts from a website (typically a blog), not all of them. So here’s my proposal:

A website that opts into preservation does so by providing a full feed of updates to allow for complete ingestion.

In other words, you provide a specially-linked RSS or Atom feed which contains the full history. The feed could, potentially, use pagination, ETags, or another mechanism to only provide the relevant information, so that we can avoid sending all data every time.

If you make an update or delete a document, that hits the feed too, with the correct timestamp (though you shouldn’t need to provide old versions of a document; it’s enough to have one record per document as long as IDs are stable and creation and update timestamps are accurate).

While we’re at it, let’s invent a new format

And while RSS or Atom would both probably do the job, there’s another solution out there: ActivityPub.

I must confess, I am not an expert, but I have read the specification. As far as I can tell, it provides an interface for an “outbox” of documents, which can be paginated, aware of permissions, incorporates edits/updates and deletions, and lots more. In addition, we might want to subscribe and ask for changes to be delivered, not just request documents on demand, which ActivityPub handles well through its “follow” activity.

There are performance concerns around ActivityPub, which seem to be mostly around subscriptions and delivery. Fortunately, an ActivityPub-compliant server does not need to support following/subscriptions; it’s optional, so it’s up to the implementor how complex they’d like to get.

ActivityPub also gives every document a unique ID, which we could use to jump from the HTML version to the lightweight version via a <link> tag or similar.

Honestly, it’s may even be reasonable for clients to support both when feasible. RSS and Atom are simpler and, if we ignore pagination, don’t require a web server at all; ActivityPub does, but handles scaling and other functionality much more gracefully. Different servers might choose different trade-offs.

And let’s be honest, we’re not going to prototype with ActivityPub.

Preservation as a social contract

Imagine, if you will, a small change to your favourite feed reader of choice. (If you don’t have a favourite feed reader, it’s OK, write to me and we can discuss it at length.)

Here’s the change: if the website has opted into preservation, it remembers everything, and stores it forever (ish; hard disks aren’t infinite).

That’s it. If you subscribe, you promise to do a decent job at storing documents. From the perspective of the server owner: in return for a content feed, people preserve your data. Everyone preserves a subset of everyone else’s.

And then, we wait.

Sharing

In any of the aforementioned feed formats, documents are uniquely identifiable when combined with the website URL.

So, it’d make sense that I could publish any of the archives that I have preserved in a manner that you could ask for a specific document and get it, in exactly the same way that on the Fediverse (powered by ActivityPub, y’know), I can ask for a post written by a user on one server from another server and if they have a copy, I’ll get it. It may not be the most up-to-date variant, but it’ll be a good effort.

Now, given that these are lightweight documents, why wouldn’t I? These are feeds I read, so I’ve already vetted them to make sure there’s nothing catastrophically terrible in there, anyway. We can each provide this public service, knowing that it’s mutual; everyone else is hosting our blogs too, just in case.

At this point, we have a distributed archive. It’s cheap to run, tolerant against hostile behaviour such as DMCA takedowns, and relatively easy to find.

It’s important to note the limitations. Even if this archive became mainstream, it would never cover everything the Wayback Machine indexes, but that’s alright. It would ease the load, especially on retrieval. It promises to be a more reliable, faster archive than the Wayback Machine, and potentially allows the Internet Archive to focus their resources on the areas where they can do the most good.

Remember the Beaker Browser?

This is a long shot, but I can’t go a week without thinking about what life might have been like if Beaker took off, so let me tell you about it.

The Beaker Browser was a browser (built on Chromium, but whatever) that could browse web pages served up by Hypercore (though at the time, it was called “the DAT protocol”). The idea is that you’d publish your website using a Git-like model, in which you commit new versions, and then these would be distributed via a BitTorrent-like protocol. If others are sharing your website (and, as we’ve agreed above, they are), clients would get it faster and at less cost to you. Edge routing, done right.

This is the “I am Spartacus” school of website delivery.

Now, the main difference between Hypercore/DAT and BitTorrent is that in BitTorrent, file trees are identified by their hash, and so if the file changes, its ID changes too. Hypercore identifies file trees by a public key, so their ID is stable across modifications, and because everything is signed by the corresponding public key, only the owner of the private key can change things.

I realise this is very much in the realm of the imagination, but: imagine if we agreed that we would disseminate feeds via something like Hypercore. It’d be, for lack of a better phrase, fucking delicious.

What about the plagiarism machines?

Forget about the plagiarism machines.

Seriously. You might be concerned that this makes life even easier for them, but sucking up content is not a problem they have. It’s costly for the Internet Archive to vacuum up the entire Internet because they are operating on on three tins of baked beans and a laptop with a potato for a battery. Venture capitalist-funded companies and trillion-dollar behemoths do not have this problem. They can simply read all the HTML and run all the JavaScript, and they already do.

They are the enemy, but they are just parasites, in the end.

Meet the new web, same as the old web

Y’know, I’ve been thinking. Imagine we have a bunch of useful documents. They’re all over the place. They can be accessed from anywhere. They have unique identifiers.

Maybe we’d just browse those, instead of this newfangled thing that tries to download 6.5 MB of JavaScript and ads every time you scroll the page?

I’m not claiming this’d replace the Web. Yet. But it might, maybe, possibly, be a way for those of us who want the web of 2004 back to get a little closer.

Making it look pretty

Oh, go on, let’s talk about CSS.

I love pretty websites. Even more importantly, I adore gorgeous typefaces. (I spent hours choosing the fonts for this website, and I consider it time well spent.) I’d be sad if I never saw your website’s fancy styles.

Maybe the feed readers would let us have a little CSS, as a treat, similar to how email clients do. Images too, while we’re at it.

Of course, you should always be able to turn it off (or leave it off by default). We’re trying to reclaim the web via subterfuge, not reproduce the mess that it’s turned into.

Back in the (very) old days, web pages all looked the same. And, rightly, browser makers competed on who could enable web developers to produce prettier websites. So they became prettier. But when the power transferral was completed, they became uglier, and then uglier still. (When was the last time you used a website without having to hit a little “X” to close a popup, even when using an ad blocker?)

The consumers of the web can no longer control anything, and all power rests with the producers. User styles have disappeared. Browsers make it harder and harder to disable JavaScript. Many sites, when read with a non-mainstream browser, such as a terminal browser or a screen reader, become completely unusable.

Let’s not make that mistake twice. Content is important, and we shall share it and spread it. Presentation is nice to have, and if it becomes useless or harmful, we’ll leave it behind.

This idea is free as in birds

If you like this idea, it’s yours. While I’d be happy to discuss it with you (and please get in touch!), and there’s even a non-zero chance I might get on board, it’s more likely I’ll wish you the best of luck, help a little when I can, and tell everyone I know how wonderful you are.

You can read more ridiculous ideas by browsing the series:

Starting from scratch
Structured archival, and the web as it once was
Search is broken
Prompts already won

A big thanks to Irene Knapp (irenes) and danny mcClanahan for reviewing this, providing valuable feedback, and contributing many ideas on top of my own.

If you enjoyed this post, you can subscribe to this blog using Atom.

Maybe you have something to say. You can email me or toot at me. I love feedback. I also love gigantic compliments, so please send those too.

Please feel free to share this on any and all good social networks.

This article is licensed under the Creative Commons Attribution 4.0 International Public License (CC-BY-4.0).

Ideas for free Structured archival, and the web as it once was