» Classifying all of the pdfs on the internet

How would you classify all the pdfs in the internet? Well, that is what I tried doing this time.

Lets begin with the mother of all datasets: Common Crawl or CC is a web archive of all of the internet, it currently is petabytes in size and has been running since 2007. Maybe, you know about the Internet Archive which is almost the same but with the main difference being that Common Crawl focuses more on archiving the internet for scientists and researchers instead of digital preservation.

What this translates into is that CC doesn’t save all of the pdfs when it finds them. Specifically, when Common Crawl gets to a pdf, it just stores the first megabyte of information and truncates the rest.

This is where SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED enters the picture. This corpus was originally created by the DARPA SafeDocs program and what it did was refetch all the different pdfs from a snapshot of Common Crawl to have untruncated versions of them. This dataset is incredibly big, it has roughly 8.4~ million pdfs that uncompressed total 8TB. This corpus is the biggest pure pdf dataset on the internet¹.