Faster than FOI and more detailed than advanced search techniques, scraping also allows you to grab data that organisations would rather you didn’t have – and put it into a form that allows you to get answers.
Scraping - getting a computer to capture information from online sources - is one of the most powerful techniques for data-savvy journalists who want to get to the story first, or find exclusives that no one else has spotted.
Paul Bradshaw will show you how to scrape content from the web and find stories that otherwise you might have been missed.
This two-day workshop in scraping is designed for reporters with no knowledge of scraping or programming and provides essential skills for getting original stories by compiling data across a range of online sources. By the end of the workshop, you will be able to use specialist scraping tools (without programming) and begin to write your own, more advanced, scrapers. You will also be able to communicate with programmers on relevant projects. (See below for more information and technical requirements - you must bring your own laptop).
- 10-10.30am Registrations
- 10:30-11:15am Introduction: What scraping is and how news organisations are using it
- 11:30-12.15pm Pitching story ideas involving scraping
- 12:15-1pm Scraping basics: finding structure in HTML and URLs
- 1-2pm Lunch
- 2-3.45pm Simple scraping jobs: checking a webpage every day; identifying information using XPath
- 4-5pm Introduction to scraping tools: Workbench and Outwit Hub
DAY 2: Looking at what's available
- 9-10.15am Advanced Workbench and Outwit Hub: scraping multiple pages
- 10:30am-12pm Scraping text that fits a pattern: regex
- 12-1pm Lunch
- 1-2pm What’s possible with programming: APIs, loops, PDFs and spreadsheets
- 2-5pm Project surgery: your scraping challenges
Large organisations (10+ people) – £309
Small organisations (9 people and fewer) – £232
Freelancers – £177
Full-time students – £109
Goldsmiths students – £87
All courses are held at Goldsmiths, University of London and run from 10am-5pm.