Lars Kölker
Data is Key: Scraping Metadata from Websites
#1about 2 minutes
How social media sites generate link previews
Social media platforms scrape hidden metadata like titles and descriptions from URLs to transform a simple link into a rich preview.
#2about 1 minute
Defining web scraping and its primary use cases
Web scraping is the practice of gathering data from websites without an API, often used when APIs are missing, rate-limited, or too expensive.
#3about 2 minutes
Why CSS selector-based scraping is brittle
Relying on specific CSS selectors for scraping creates a fragile solution that is tied to a single site and breaks whenever the source code changes.
#4about 4 minutes
Generic scraping with schema.org and JSON-LD
Schema.org provides a standardized vocabulary for structured data, enabling the creation of generic scrapers using formats like JSON-LD.
#5about 5 minutes
Using meta tags for structured data extraction
Protocols like Open Graph (OGP) and Twitter Cards extend standard HTML meta tags to provide rich, structured metadata for social sharing and scraping.
#6about 4 minutes
The oEmbed protocol for embedded content
The oEmbed protocol offers a standardized endpoint for retrieving embeddable representations of a URL, which is essential for sites like Instagram.
#7about 1 minute
Showcasing a powerful multi-protocol scraper
A demonstration shows how combining different scraping techniques can extract rich information, including product prices and author images, from various websites.
#8about 3 minutes
Q&A on legality, rate limits, and frameworks
The speaker addresses audience questions regarding the legality of scraping, managing rate limits, and recommended frameworks like Beautiful Soup.
Related jobs
Jobs that call for the skills explored in this talk.
WALTER GROUP
Wiener Neudorf, Austria
Intermediate
Senior
Python
Data Vizualization
+1
Matching moments
03:31 MIN
The value of progressive enhancement and semantic HTML
WeAreDevelopers LIVE – You Don’t Need JavaScript, Modern CSS and More
04:30 MIN
Understanding browser APIs that rely on company services
Developer Time Is Valuable - Use the Right Tools - Kilian Valkhof
08:29 MIN
How AI threatens the open source documentation business model
WeAreDevelopers LIVE – AI, Freelancing, Keeping Up with Tech and More
04:57 MIN
Increasing the value of talk recordings post-event
Cat Herding with Lions and Tigers - Christian Heilmann
02:33 MIN
Why you might not need JavaScript for everything
WeAreDevelopers LIVE – You Don’t Need JavaScript, Modern CSS and More
01:54 MIN
The growing importance of data and technology in HR
From Data Keeper to Culture Shaper: The Evolution of HR Across Growth Stages
06:47 MIN
Solving date and time issues with the Temporal API
WeAreDevelopers LIVE – You Don’t Need JavaScript, Modern CSS and More
02:39 MIN
Establishing a single source of truth for all data
Cat Herding with Lions and Tigers - Christian Heilmann
Featured Partners
Related Videos
Scrape, Train, Predict: The Lifecycle of Data for AI Applications
Vidas Bacevičius
From clicks to cribs - How to find your dream home with web scraping
Alexander Lichter
How to scrape modern websites to feed AI agents
Jan Curn
Data Science on Software Data
Markus Harrer
WeAreDevelopers LIVE – Web Scraping, Agents, Actors and more
Chris Heilmann, Daniel Cranney, Ondra Urban & COO & GTM at Apify
Bringing Clarity to Event Streams: Enabling Analytics and AI Through Rich Metadata
Clemens Vasters
Web APIs you might not know about
Sasha Shynkevich
Web-based Information Visualization
Johanna Schmidt
Related Articles
View all articles



From learning to earning
Jobs that call for the skills explored in this talk.

Forschungszentrum Jülich GmbH
Jülich, Germany
Intermediate
Senior
Linux
Docker
AI Frameworks
Machine Learning

Visonum GmbH
Remote
Junior
Intermediate
React
Redux
TypeScript




fairparken GmbH
Düsseldorf, Germany
Remote
€55-65K
Python
Microsoft SQL Server


The Rolewe
Charing Cross, United Kingdom
API
Python
Machine Learning
