Lars Kölker

Data is Key: Scraping Metadata from Websites

Stop writing brittle, site-specific scrapers. Learn to parse structured metadata and treat the web as one giant, queryable API.

Data is Key: Scraping Metadata from Websites
#1about 2 minutes

How social media sites generate link previews

Social media platforms scrape hidden metadata like titles and descriptions from URLs to transform a simple link into a rich preview.

#2about 1 minute

Defining web scraping and its primary use cases

Web scraping is the practice of gathering data from websites without an API, often used when APIs are missing, rate-limited, or too expensive.

#3about 2 minutes

Why CSS selector-based scraping is brittle

Relying on specific CSS selectors for scraping creates a fragile solution that is tied to a single site and breaks whenever the source code changes.

#4about 4 minutes

Generic scraping with schema.org and JSON-LD

Schema.org provides a standardized vocabulary for structured data, enabling the creation of generic scrapers using formats like JSON-LD.

#5about 5 minutes

Using meta tags for structured data extraction

Protocols like Open Graph (OGP) and Twitter Cards extend standard HTML meta tags to provide rich, structured metadata for social sharing and scraping.

#6about 4 minutes

The oEmbed protocol for embedded content

The oEmbed protocol offers a standardized endpoint for retrieving embeddable representations of a URL, which is essential for sites like Instagram.

#7about 1 minute

Showcasing a powerful multi-protocol scraper

A demonstration shows how combining different scraping techniques can extract rich information, including product prices and author images, from various websites.

#8about 3 minutes

Q&A on legality, rate limits, and frameworks

The speaker addresses audience questions regarding the legality of scraping, managing rate limits, and recommended frameworks like Beautiful Soup.

Related jobs
Jobs that call for the skills explored in this talk.

Featured Partners

Related Articles

View all articles
DC
Daniel Cranney
AI & A11Y, Meta's privacy and the future of SEO
Inside last week’s Dev Digest 173 . 🏆 GitHub reaches 1bn repos, with underwhelming final submission 🎮 Atari 2600 beats ChatGPT at chess 💬 Chatbots don’t improve work for 7k companies 🕵️ Meta AI app is a privacy disaster ⚠️ Microsoft Copilot’s Zero C...
AI & A11Y, Meta's privacy and the future of SEO
CH
Chris Heilmann
Dev Digest 151: SEO in an AI world, security fixes and Doomed PDFs
Inside last week’s Dev Digest 151 . 🔎 How ChatGPT compares to search and what that means for SEO ✂️ Job cuts across the board as companies curb DEI programs 🟨 @Microsoft releases 161 Windows security updates ⚠️ @Google’s OAuth bug endangers million...
Dev Digest 151: SEO in an AI world, security fixes and Doomed PDFs
DC
Daniel Cranney
Dev Digest 195: End of Likes, JavaScript’s a Zoo, and Messing with Bots!
Inside last week’s Dev Digest 195 . 👎 No more external likes 🤗 Needy programs 📉 The worst selling Microsoft product 🟨 JavaScript engines zoo 🍞 No more toasts! 🤖 Messing with bots 👔 Beware of fake job interviews 🗞️ Join over 150,000 developers alread...
Dev Digest 195: End of Likes, JavaScript’s a Zoo, and Messing with Bots!
CH
Chris Heilmann
Dev Digest 134 - Where pixels sing?
News and ArticlesWeAreDevelopers LIVE Data and Security Day is on Wednesday, 25/09/2024. Learn about OPC UA Updates, Best Practices for Using GitHub Secrets, Passwordless Web 1.5, Emerging AI Security Risks, Data Privacy in LLMs and get a chance to t...
Dev Digest 134 - Where pixels sing?

From learning to earning

Jobs that call for the skills explored in this talk.

Data Scientist

Data Scientist

August-Wilhelm Scheer Institut für digitale Produkte und Prozesse gGmbH
Saarbrücken, Germany

Java
Python
PyTorch
TensorFlow
Data analysis
+1