The Centre for Investigative Journalism
The Centre for Investigative Journalism

Data Cleaning with Pandas 1

Data cleaning can feel more like data penance, but Pandas can ease your pain, allowing you to clean and structure your data with minimal hassle. Jupyter Notebook’s interactive environment helps you keep track of your changes and allows you to explore your data.

Participants can expect to learn how to clean large complicated datasets quickly and learn how to explore data too large for Excel by using the browser based Jupyter Notebook.

Why Python?

Python makes it easy to replicate your analysis at a later stage and reduces the threat of human error that many face in Excel. It’s also shareable within teams and allows you to document and explain your work within the notebook so you can come back to it later and easily pick up from where you left off.
There are no upper limits in terms of data size, you can use Python on a csv with 10 rows or a billion. You get to a point where the limitation is the speed of the RAM on your machine, at which point you need to switch to a server.

Technical Requirements

Participants should have previous experience of coding at a basic level or more.

Karrie Kehoe

Karrie Kehoe is a data journalist and researcher on the Data and Research team at the International Consortium of Investigative Journalists. Karrie has worked on award-winning global investigations like the FinCEN Files, Pandora Papers, Uber Files, Implant Files and more recently Deforestation Inc.

Max Harlow

Max Harlow works on the visual and data journalism team at the Financial Times, focusing on investigations. He also runs Journocoders, a group for journalists to develop technical skills for use in their reporting.