“Search engines are completely useless for finding any content on the deep web,” said Ude.
So how can journalists harvest the deep web?
Think abstract, said Ude. Don’t think about the specific content you want to find, but rather where such content might exist; then find related databases. Search engines won’t insert names into databases, so you’ll have to do that yourself.
For example, if you need contact information for a specific architect and know where he or she lives, check if there’s a regional professional association database. That’s how Ude tracked down a source whose email address didn’t seem to exist online.
Here are four tips on how to identify databases that can give you information that Google won’t.
1. Who Runs the Database?
Who is likely to invest time and money to create and maintain a database with the kind of information you’re looking for? “This problem will not be solved by a search engine, but by your head,” Ude said.
2. Hack Search Engines
Find databases by searching for your topic with “database OR directory OR catalogue OR registry” on a search engine. If you want some privacy, Dutch company www.startpage.com runs searches for you on Google, without giving the tech giant your information.
3. Use Wikipedia
Look up the topic on Wikipedia and check the “External links” section at the bottom of the page. Those links are of generally higher quality than those delivered by search engine results, according to Ude.
Follow Wikipedia categories pages and keyword links. And search in local languages.
4. Search for Database Lists
If searching in English, type the phrase “a * z database” into a search engine. This will return a list of “A to Z” databases.
Use a university library in your city. This will give you access to thousands of scientific databases that usually charge a subscription rate. Some universities charge annual fees to make use of their facilities if you’re not a student, but this is much cheaper than paying subscription fees for databases.
German speakers can use the “database of databases.” The University of Regensburg lists more than 10,000 databases.
Be sure to search in other languages if relevant.
Ude shared databases that you absolutely must know:
Archives are one of the best tools to search for records, specifically deleted pages. For example, you can find information that a company may have removed or changed following a news event. Search the Wayback Machine for archived pages, or archive a page you want saved on Archive.today.
IANA Root Zone Database has information on who owns all valid, usable top-level domains. New information is not available in the EU due to new privacy laws, but there are ongoing efforts to negotiate access for journalists.
Common Vulnerabilities and Exposures is a great database to investigate internet fraud and has “every known security leak on the net,” according to Ude.
Tenders Electronic Daily lists where exactly the European Union is spending its money. Designed for investors, it’s updated daily.
Directory of Open Access Journals indexes peer-reviewed scientific journals whose articles are available for free.
National libraries can be excellent resources to find databases. Wikipedia has a list of national and state libraries.