Web Scraping for Journalists

    Scraping - getting a computer to capture information from online sources - is one of the most powerful techniques for data-savvy journalists who want to get to the story first, or find exclusives that no one else has spotted.

    Room - TBC, Goldsmiths, University of London
    23 January 2018 - 25 January 2018

    Paul Bradshaw will show you how to scrape content from the web and find stories that otherwise might have been missed.


    Big organisations (10+ people) - £405
    Freelancers and small organisations (9 people and fewer) - £305
    Students (correspondence/evening course) - £205 (limited availability)
    Students (full time) - £155 (limited availability)

    Full time Goldsmiths' students get 20% discount on all CIJ courses. Please contact marina(at)tcij.org for more details. (Limited availability)

    Please see below for registration and payment options. 

    Course Outline

    This three-day workshop in scraping is designed for reporters with no knowledge of scraping or programming and provides essential skills for getting original stories by compiling data across a range of online sources. By the end of the workshop, you will be able to use specialist scraping tools (without programming) and begin to write your own, more advanced, scrapers. You will also be able to communicate with programmers on relevant projects. (See below for more information and technical requirements - you must bring your own laptop).

    Paul Bradshaw runs the MA in Data Journalism and the MA in Multiplatform and Mobile Journalism at Birmingham City University, and also works as a consulting data journalist with the BBC England Data Unit. A journalist, writer and trainer, he has worked with news organisations including The Guardian, Telegraph, Mirror, Der Tagesspiegel and The Bureau of Investigative Journalism. He publishes the Online Journalism Blog, is the co-founder of the award-winning investigative journalism network HelpMeInvestigate.com, and has been listed on both Journalism.co.uk's list of leading innovators in media, and the US Poynter Institute's list of the 35 most influential people in social media.
    Tuesday, 22 January: Scraping basics
    Room: TBC
    10-10.30am           Registrations
    10:30-11:15am      Introduction: What scraping is and how news organisations are using it
    11:30-12.15pm      Pitching story ideas involving scraping
    12:15-1pm             Scraping basics: finding structure in HTML and URLs
    1-2pm                    Lunch
    2-3.45pm               Simple scraping jobs: checking a webpage every day; identifying information using XPath
    4-5pm                    Introduction to scraping tools: Outwit Hub
    Wednesday, 24 January: Looking at what's available
    Room: TBC
    9-10am                   Advanced Outwit Hub: scraping multiple pages
    10-10:15am            What's possible with programming: APIs, regex and loops
    10:30am-12pm       Scraping text that fits a pattern: regex
    12-1pm                    Lunch
    1-3.45pm                 Basic scraping with Python and Morph.io
    4-5pm                      Scraping database search results by following links: loops
    Thursday, 25 January: Advanced techniques
    Room: TBC
    9-10am                    Advanced scraping: spreadsheets
    10-11am                  Advanced scraping: PDFs
    11am-12pm             Scraping lab: problem solving
    12-1pm                    Lunch
    1-4pm                      Scraping lab: problem solving
    4-5pm                      Wrap up, final results
    More details: 
    Faster than FOI and more detailed than advanced search techniques, scraping also allows you to grab data that organisations would rather you didn’t have - and put it into a form that allows you to get answers.
    Scraping is the process of automatically collating information from the web. It might be grabbing entries across hundreds of webpages, fetching and combining dozens of spreadsheets, or thousands of PDFs.
    The results have led to exclusive stories for organisations ranging from the Bureau of Investigative Journalism and Trinity Mirror, to DC Thomson, Channel 4 and the BBC.
    Technical requirements: 
    Delegates will be using their own laptop and should have a Google drive account, download the free version of outwit hub. A GitHub account would also be useful.
    The software is all free. However the free version of OutWit Hub only allows you to scrape 100 rows, so you may want to pay for the full version but can decide after you've learnt how to use it on the course.
    BOOK NOW: two options

    (1) Eventbrite booking HERE


    (2) PayPal/bank transfer

    Please REGISTER HERE first. 

    Then pay by bank transfer (details in the registration form), by credit card (via PayPal - no account nessesary) or with your PayPal account
    Your place is not confirmed until a full payment is received.

    Pay now with PayPal or Credit Card