The Centre for Investigative Journalism
The Centre for Investigative Journalism

Web Scraping for Journalists

This two-day workshop in scraping is designed for reporters with no knowledge of scraping or programming and provides essential skills for getting original stories by compiling data across a range of online sources. By the end of the workshop, you will be able to use specialist scraping tools (without programming) and begin to write your own, more advanced, scrapers. You will also be able to communicate with programmers on relevant projects.

Scraping is the process of automatically collating information from the web. It might be grabbing entries across hundreds of webpages, fetching and combining dozens of spreadsheets, or thousands of PDFs.
The results have led to exclusive stories for organisations ranging from the Bureau of Investigative Journalism and Trinity Mirror, to DC Thomson, Channel 4 and the BBC.

Technical Requirements

Delegates will be using their own laptop and should have a Google drive account, and have downloaded the free version of outwit hub ahead of the course. A GitHub account would also be useful.

The software is all free. However the free version of OutWit Hub only allows you to scrape 100 rows, so you may want to pay for the full version but can decide after you’ve learnt how to use it on the course.

9 December 2019 – Scraping basics

Introduction: What scraping is and how news organisations are using it
Pitching story ideas involving scraping
Scraping basics: finding structure in HTML and URLs
Simple scraping jobs: checking a webpage every day; identifying information using XPath
Introduction to scraping tools: Outwit Hub

10 December 2019 – Looking at what’s available

Advanced Outwit Hub: scraping multiple pages
What’s possible with programming: APIs, regex and loops
Scraping text that fits a pattern: regex
Advanced scraping options: coding, PDFs and spreadsheets
Project surgery: your scraping challenges

Paul Bradshaw

Professor Paul Bradshaw is an online journalist and blogger, who leads the MA in Multiplatform and Mobile Journalism at Birmingham City University. He manages his own blog, the Online Journalism Blog (OJB), and was the co-founder of Help Me Investigate, an investigative journalism website funded by Channel 4 and Screen WM.
  • 9 December 2019 10.00–17.00
  • 10 December 2019 09.00–16.00
Location: Goldsmiths, University of London