Jan Curn

How to scrape modern websites to feed AI agents

What if your AI could find and use new tools on its own? See how dynamic tool discovery creates powerful agents that can scrape the modern web.

How to scrape modern websites to feed AI agents
#1about 1 minute

Why web data is essential for training large language models

LLMs are trained on massive web datasets like Common Crawl, but this leads to knowledge cutoffs and hallucinations.

#2about 2 minutes

How RAG provides LLMs with up-to-date context

Retrieval-Augmented Generation (RAG), or context engineering, feeds external, live data to LLMs to produce more accurate and timely answers.

#3about 3 minutes

Navigating the complexities of modern web scraping

Modern websites use dynamic JavaScript rendering and anti-bot measures, requiring headless browsers, proxies, and CAPTCHA solvers to access data.

#4about 2 minutes

Cleaning messy HTML and scaling data extraction

To avoid the 'garbage in, garbage out' problem, you must clean HTML by removing cookie banners and ads, and manage complexities like sitemaps and robots.txt.

#5about 3 minutes

Demo of scraping a website with Apify Actors

A demonstration shows how to use the Apify Website Content Crawler to perform a deep crawl of a website and extract its content into markdown.

#6about 2 minutes

Building a RAG chatbot with scraped data and Pinecone

The scraped website data is uploaded to a Pinecone vector database, enabling a chatbot to answer questions using the site's specific content.

#7about 1 minute

Using the Model Context Protocol for AI agent integration

The Model Context Protocol (MCP) provides a fluid, dynamic interface for AI agents to communicate with and discover tools, unlike static traditional APIs.

#8about 3 minutes

Demo of dynamic tool discovery using MCP

An AI agent uses MCP to dynamically search the Apify store for a Twitter scraper, add it to its context, and then use it to fetch live data.

Related jobs
Jobs that call for the skills explored in this talk.

Featured Partners

Related Articles

View all articles
CH
Chris Heilmann
Everything a Developer Needs to Know About MCP with Neo4j
In the rapidly evolving world of AI tooling and agentic workflows, one protocol is reshaping how developers build, scale, and share AI-native applications: the Model Context Protocol (MCP). If you’ve been building AI agents, you know the pain of inte...
Everything a Developer Needs to Know About MCP with Neo4j
DC
Daniel Cranney
What is Agentic Programming and Why Should Developers Care?
Since the release of tools like ChatGPT and GitHub Copilot, the way developers work has shifted dramatically. What began as simple autocomplete in the early versions of Copilot has quickly evolved into agentic programming, where AI agents can take on...
What is Agentic Programming and Why Should Developers Care?

From learning to earning

Jobs that call for the skills explored in this talk.