Web Scraping for Journalists

This two-day workshop in scraping is designed for reporters with no knowledge of scraping or programming and provides essential skills for getting original stories by compiling data across a range of online sources. By the end of the workshop, you will be able to use specialist scraping tools (without programming) and begin to write your own, more advanced, scrapers. You will also be able to communicate with programmers on relevant projects.

Scraping is the process of automatically collating information from the web. It might be grabbing entries across hundreds of webpages, fetching and combining dozens of spreadsheets, or thousands of PDFs.

The results have led to exclusive stories for organisations ranging from the Bureau of Investigative Journalism and Trinity Mirror, to DC Thomson, Channel 4 and the BBC.

Technical Requirements

Delegates will be using their own laptop and should have a Google drive account, and have downloaded the free version of outwit hub ahead of the course. A GitHub account would also be useful.

The software is all free. However the free version of OutWit Hub only allows you to scrape 100 rows, so you may want to pay for the full version but can decide after you’ve learnt how to use it on the course.

9 December 2019 – Scraping basics

10:00–10:30

Registrations

10:30–11:30

Introduction: What scraping is and how news organisations are using it

11:30–12:15

Pitching story ideas involving scraping

12:15–13:00

Scraping basics: finding structure in HTML and URLs

13:00–14:00

Lunch

14:00–15:45

Simple scraping jobs: checking a webpage every day; identifying information using XPath

16:00–17:00

Introduction to scraping tools: Outwit Hub

10 December 2019 – Looking at what’s available

09:00–10:00

Advanced Outwit Hub: scraping multiple pages

10:00–10:15

What’s possible with programming: APIs, regex and loops

10:15–12:00

Scraping text that fits a pattern: regex

12:00–13:00

Lunch

13:00–14:00

Advanced scraping options: coding, PDFs and spreadsheets

15:00–16:00

Project surgery: your scraping challenges

Paul Bradshaw

Professor Paul Bradshaw is an online journalist and blogger, who leads the MA in Data Journalism at Birmingham City University. He manages his own blog, the Online Journalism Blog (OJB), and was the co-founder of Help Me Investigate, an investigative journalism website funded by Channel 4 and Screen WM.

9 December 2019 10.00–17.00
10 December 2019 09.00–16.00

Location: Goldsmiths, University of London