Name: Scraping the unscrapable: advanced approaches to deal with complex sites and evade anti-scraping systems
Start: 2024-06-01T11:15:00+0200
End: 2024-06-01T12:30:00+0200

Join us for networking, learning, sharing and fun!

Back To Schedule

Scraping the unscrapable: advanced approaches to deal with complex sites and evade anti-scraping systems

Scraped data can often be the backbone of an investigation, but some websites are more difficult to scrape than others. This could be because of the sheer volume of data you need, or the way the site is built - either accidentally or deliberately making it hard to scrape. This session will cover best practices for dealing with tricky sites, including coping with captchas, using proxy and other scraping services, and best ways to scale up your scraping by using the cloud. This is an advanced session aimed at people who already have experience of writing code to scrape websites and want to move up to the next level: participants will leave with an understanding of how to approach hard-to-scrape websites, plus the tradeoffs and costs.

Speakers

Max Harlow

Financial Times

Max Harlow works on the visual and data journalism team at the Financial Times, focusing on investigations. He also runs Journocoders, a group for journalists to develop technical skills for use in their reporting.

Saturday June 1, 2024 11:15am - 12:30pm CEST
2.10

Data skills, Hands-on

Feedback form isn't open yet.

Dataharvest 2024 - the European Investigative Journalism Conference

Max Harlow

Attendees (17)

Dataharvest 2024 - the European Investigative Journalism Conference

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Max Harlow

Attendees (17)