Archiving websites

Where I discover the WARC format to archive websites.

Created: by Pradeep Gowda Updated: Jun 14, 2024 Tagged: web

On HN, I read about the warc format.

WARC, standardized as ISO 28500:2009, Information and documentation – WARC file format. Developed under the auspices of the International Internet Preservation Consortium. WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term, allowing for capture of more metadata about the circumstances of archiving. WARC files are often compressed using gzip, resulting in a .warc.gz extension.

I have always thought of publishing my website as an archive for posteirity. But, wondered what would be a good way to achieve that. The questions I have were around - where can I store this archive (can I send it to, why would they consider storing it? Maybe I should make a contribution to help them with costs etc.,

Looking into the links on that wikipedia page seems like a good start:

Both the above tools (in Python) are a good candidate to be rewritten in somehting like D or Rust as a nice programming exercise!


What’s the best way to build a website as an archive or library?

Creating a Safari webarchive from the command line – alexwlchan

WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI | Library Innovation Lab