Web Scraping for Journalists

    Scraping - getting a computer to capture information from online sources - is one of the most powerful techniques for data-savvy journalists who want to get to the story first, or find exclusives that no one else has spotted.


    Goldsmiths University London
    19 January 2016 - 21 January 2016

    Paul Bradshaw will show you how to scrape content from the web and find stories that otherwise might have been missed.

    This course is now fully booked. If you would like your name to be added to the waiting list, please email: juliet@tcij.org

    Registrations will take place at 10am in the Ben Pimlott Building PC lab 3/4:

    Goldsmiths, University of London, Lewisham Way, New Cross, London SE14 6NW
    See a map of the area for more details
    Sadly we've not been able to book the same room for the entire course so there'll be some moving around 
    Tuesday: Ben Pimlott Building PC lab 3/4
    Wednesday: Richard Hoggart Building room 2107
    Thursday: Richard Hoggart Building room 143

    Richard Hoggart Building: http://www.gold.ac.uk/static/virtual-tours/richard-hoggart-building.html

    Course price

    We are offering a 25% discount for the second delegate and subseqeunt delegates from the same organisation.


    Course Outline

    Faster than FOI and more detailed than advanced search techniques, scraping also allows you to grab data that organisations would rather you didn’t have - and put it into a form that allows you to get answers.

    Scraping is the process of automatically collating information from the web. It might be grabbing entries across hundreds of webpages, fetching and combining dozens of spreadsheets, or thousands of PDFs.

    The results have led to exclusive stories for organisations ranging from the Bureau of Investigative Journalism and Trinity Mirror, to DC Thomson, Channel 4 and the BBC.

    This three-day workshop in scraping is designed for reporters with no knowledge of scraping or programming and provides essential skills for getting original stories by compiling data across a range of online sources.

    By the end of the workshop, you will be able to use specialist scraping tools (without programming) and begin to write your own, more advanced, scrapers. You will also be able to communicate with programmers on relevant projects.

    Delegates will be using their own laptop and should have a Google drive account, download import.io and the free version of outwit hub, plus install the kimono labs extension for chrome. A GitHub account would also be useful.

    The software is all free. However the free version of OutWit Hub only allows you to scrape 100 rows, so you may want to pay for the full version but can decide after you've learnt how to use it on the course.

    Please get in touch with Juliet: juliet@tcij.org if you have any questions.

    About Paul Bradshaw
    Paul Bradshaw is an online journalist and writer, and a professor at Birmingham City University. He has worked with news organisations including the Bureau of Investigative Journalism, BBC, The Guardian, Mirror, the Balkan Investigative Reporters Network, Scotland's Sunday Post, and ITV.

    He manages his own blog, the Online Journalism Blog (OJB), and is the co-founder of Help Me Investigate, an investigative journalism website funded by Channel 4 and Screen WM. He has written about journalism for journalism.co.uk, Press Gazette, The Guardian's Data Blog, Nieman Reports and the Poynter Institute in the US.
    Bradshaw is the author of the Online Journalism Handbook, co-written with former Financial Times web editor Liisa Rohumaa, and a number of books on data journalism including Scraping for Journalists, The Data Journalism Heist, and Finding Stories in Spreadsheets (leanpub.com/u/paulbradshaw).
    Paul was a Visiting Professor at City University's School of Journalism in London for five years. He has also contributed to books including Investigative Journalism (2nd Ed), FOI Ten Years On, Ethics for Digital Journalists, and Data Journalism: Mapping the Future.


    Please note, the timetable is subject to change.

    Tuesday, 19 January: Scraping basics
    Ben Pimlott Building PC lab 3/4

    10-10.30am           Registrations

    10:30-11:15am      Introduction: What scraping is and how news organisations are using it

    11:30-12.15pm      Pitching story ideas involving scraping

    12:15-1pm             Scraping basics: finding structure in HTML and URLs

    1-2pm                    Lunch

    2-3.45pm                    Simple scraping jobs: checking a webpage every day; identifying information using XPath

    4-5pm                    Introduction to scraping tools: Outwit Hub

    Wednesday, 20 January: Looking at what's available
    Richard Hoggart Building room 2107

    9-10am                   Advanced Outwit Hub: scraping multiple pages

    10-10:15am            What's possible with programming: APIs, regex and loops

    10:30am-12pm       Scraping text that fits a pattern: regex

    12-1pm                    Lunch

    1-3.45pm                 Basic scraping with Python and Morph.io

    4-5pm                      Scraping database search results by following links: loops

    Thursday, 21 January: Advanced techniques
    Richard Hoggart Building room 143

    9-10am                    Advanced scraping: spreadsheets

    10-11am                  Advanced scraping: PDFs

    11am-12pm             Scraping lab: problem solving

    12-1pm                    Lunch

    1-4pm                      Scraping lab: problem solving

    4-5pm                      Wrap up, final results