Foundations of data journalism – Scraping
Data scraping web pages with Google sheets, scraping PDFs with Tabula and PDFtoExcel. Build graphics with Flourish Studio. We’ll also explore how to scrape data from a photograph using Excel.
Data scraping web pages with Google sheets, scraping PDFs with Tabula and PDFtoExcel. Build graphics with Flourish Studio. We’ll also explore how to scrape data from a photograph using Excel.
Learn how to probe through millions of documents, including audio, video, PDFs and more, to find patterns or that proverbial needle in a haystack. You’ll see case studies of how Gannett journalists have used this software to break big investigative stories.
Download the collection of links discussed during the training below.
Execute deeper, more precise, more sophisticated web searches by harnessing specialized search engines. Diversity your stories, angles, sources and readers.
We cover:
Tips and traps: Tools such as Signal app, Freedome and VPNs to keep you, your data and your sources safe. We’ll look at a few Google tools (including how to unplug from Maps tracking) as well as resources from the Digital Security section of Journalist’s Toolbox.
We also look at data scraping: How to scrape data from web pages with Google Sheets, browser-based plug-ins and scraping .PDFs with Tabula.technology. Students should download the Tabula software at http://tabula.technology
CSV Match is an open-source fuzzy-matching library that uses some of the same algorithms as Google’s Open Refine, only it finds matches between two files rather than within one.
Learn tools and techniques for working with data buried in PDF files. Python experience is recommended.
Do you have a dataset you want to feature in your article as a table or a searchable database? There are a few easy tools to help you show your work: Google Flourish, Airtable and Tableizer.
Links to tools:
Google Flourish (set up a free account prior): https://flourish.studio/
Airtable: https://airtable.com/
Tableizer: https://tableizer.journalistopia.com/
Data to Build the Tables and Databases
COVID-19 Cases/Deaths by County: https://docs.google.com/spreadsheets/d/114DjZZqJFxoOV_4lxgyzDH9X-kveXDOH/edit?usp=drive_web&ouid=101717595278789621083&rtpof=true
Football Coach Salaries: https://drive.google.com/drive/folders/10UnuNnB0McI_g2Qhxl91GRp_ZQ3i-0AU
Link to PowerPoint: https://docs.google.com/presentation/d/1kP6atqFTMi9kwq-PuWBnNky_pXMAaTN5/edit?usp=drive_web&ouid=101717595278789621083&rtpof=true
As a reporter, if you remember nothing else from college stats, it should be how to tell if a number is both newsworthy and trustworthy. This session also might be called: advanced math for journalists!
Data visualization for everyone using InfoGram, an easy but powerful tool available to journalists across Gannett.
Here’s data for a line chart example we used. https://docs.google.com/spreadsheets/d/1HGXDys6Rxd9oqYIfOseTjzp2XcBQo13E0VJhkfVn3GQ/edit?usp=sharing
Here’s the MVP data from Sports Reference as an example we used, with pivot table and final sheet tabs: https://docs.google.com/spreadsheets/d/1NMANIMwjiXFm4Yb4VRD7d8EoWYY-6Pn4LSE1lJBT-VY/edit?usp=sharing
iframe wrapper for embedding https://www.gannett-cdn.com/experiments/usatoday/tools/static-embed-generator/index.html
More detail on the table feature in Infogram https://infogram.com/covid-19-vaccines-at-ford-field-and-detroit-mobile-sites-1hdw2jplgwn8j2l
xample of an image, text and chart combo on one infogram: https://infogram.com/candidate-josh-kaul-1h8n6mkq58d92xo?live
Government databases are public records, yet most agencies treat records requests for data as something exotic and fraught with challenges. Anticipate the objections and defeat them for the win. Conversation led by Steve Suo and Nick Penzenstadler of the USA TODAY investigations team.
Slice and dice data to to organize your reporting, find patterns and reveal better stories. Led by Erin Mansfield and Nick Penzenstadler of the USA TODAY data/investigations team.
How are you sure that great source with the perfect quote isn’t too good to be true? Even great reporters can get tricked by fake names or sketchy backgrounds. We’ll walk through some websites and strategies you can use to create a routine and spot potential red flags before you get burned.
Sign up for Google’s Backlight tool: goo.gle/getbacklight
Find information faster by learning how to power search on special sites for datasets, images and court documents. See what users in your area are searching and discover story ideas on Google Trends. Explore Backlight, an AI tool for parsing massive amounts of documents.
A discussion on developing a documents state of mind — the key to doing solid watchdog work on a beat. We’ll explore key records on a variety of beats and give practical tips on using open records laws. We’ll give you a checklist of what to know before you make a request and advice on wording your requests for documents and data.