Don't archive everything* right away

May 25, 2023

In blog posts and courses on Open Source Intelligence (OSINT) there is often a section to convey the importance of archiving everything you come across. If something related to your research gets deleted / altered somehow, it's good to have a backup of sorts. Probably the most well-known and circulated website archiving tool is the Internet Archive's WayBack Machine.

The WayBack Machine is a very common tool for investigative journalists. Here's a blog post discussing its use from the Global Investigative Journalism Network: https://gijn.org/2021/05/05/tips-for-using-the-internet-archives-wayback-machine-in-your-next-investigation/

https://archive.org/web/

https://archive.org/web/

Something I haven't seen mentioned before is the fact that the use of archiving tools such as the WayBack machine can inadvertently alert the website owner that they are potentially being observed.

This could pose a problem for ongoing investigations.

For example, imagine you are doing research on an extremist group and come across a user's personal website in a Telegram channel. The initial instinct might be to save the current state of the page with the WayBack machine. However, a technically proficient user may be monitoring access logs for their website.

In the above screenshot, I am looking at the access logs for a website that I control. With no additional tooling other than the nginx (a popular web server) defaults, I can view the IP address of visitors as well as their User Agent and other request information. Things like the IP address, User Agent, and even order of requests could be used to ‘fingerprint’ requests coming from the WayBack machine. From the IP address present in the screenshot, I can easily check the organization which has control of it.

Once the user sees this, they may tell the rest of the Telegram channel and move communications to a platform you are not monitoring. If I wanted to go further, I could set alerts for all IP addresses controlled by the Internet Archive, which is simple to find.

I'm not saying that you shouldn't archive websites you come across– you definitely should. Rather, I am saying that before using public and popular archiving tools, consider a few things:

  1. Is the website owner likely to be technically proficient?
  2. Is the investigation ongoing, and will the website owner being alerted of observation create potential issues?

Even if the creator of the content you are archiving does not have access to website logs, they could intermittently check the public WayBack machine itself to see if certain pages have been archived (this can also be achieved programmatically). In this sense, archiving with public tools could act as a canary.

Regardless of the above, you should always download pages locally and take screenshots. This is not nearly as good as having a copy on the WayBack machine, but it is better than nothing.


To create a solution, certain criteria must be met. The main value of website archives comes from their reputation and trust. While the Internet Archive's WayBack Machine has been successfully used as evidence in court before (although not without struggle), any tool that I would create would start with absolutely 0 trust behind it. Ideally, an existing, reputable archiving platform creates a similar solution.

https://xkcd.com/927/

https://xkcd.com/927/

A web archive solution that avoids the problems I have outlined could include the following characteristics.

  • Extremely difficult to fingerprint
    • IP rotation is probably necessary, but would cost a decent amount of money to continuously be using new IPs that people haven't noted down.
    • Randomize choice of User-Agent from list of various popular browsers and devices (do this type of randomization for every request variable you can think of)
    • Randomized order of requests (don't always do index.html, then styles.css, then favicon.ico, etc.)
    • Don't load javascript unless necessary
  • Archives can not be public
    • Otherwise, the website owner could check / scrape the archive site at various intervals to see if there are new snapshots taken.
    • Should require some form of password. The password could be a hash of the html content of the page for example.

Please let me know if you have other ideas! (contact)

OSINTOPSEC

Should you sell your personal data?