Tech

Archivists Create a Searchable Index of 107 Million Science Articles

searching the archive

The General Index is here to serve as your map to human knowledge. Pulled from 107,233,728 journal articles, The General Index is a searchable collection of keywords and short sentences from published papers that can serve as a map to the paywalled domains of scientific knowledge.

In full, The General Index is a massive 38 terabyte archive of searchable terms. Compressed, it comes to 8.5 terabytes. It can be pulled directly from archive.org, which can be a difficult and lengthy process. People on the /r/DataHoarder subreddit have uploaded the data to a remote server and are spreading it across BitTorrent. You can help by grabbing a seed here.

Videos by VICE

The General Index does not contain the entirety of the journal articles it references, simply the keywords and n-grams—a string of simple phrases containing a keyword—that make tracking down a specific article easier. “This is an early release of the general index, a work in progress,” Carl Malamud, the founder of Public.Resource.org and co-creator of the General Index, said in a video about the archive. “In some cases text extraction failed, sometimes metadata is not available or is perhaps incorrect while the underlying corpus is large, it is not complete and it is not up to date.”

For Malamud, a searchable database of scientific knowledge is key to human progress. “This is a lookup tool, a dictionary of knowledge, a map to knowledge, a tool that we believe is a central facility to the practice of science in our modern age,” he said. “We view this as a public utility. We assert no ownership over the general index. It is dedicated to the public domain. A series of unencumbered facts with which you can do what you will. There are no rights reserved.”

Publicly sharing paywalled scientific articles is, technically, against the law. Several governments have been trying to shut down Sci-Hub, the pirate bay of science, for years now. Malamud is banking that The General Index is transformative enough that it falls under public domain.

Malamud has gotten into trouble for this kind of thing before. The State of Georgia sued him and accused him of terrorism after he posted its laws online for everyone to read. The case went to the Supreme Court and Malamud won. 

“Science is a language we must all speak if we are to better our world,” Malamud said of The General Index.