Advanced Scraping

Scrape any page and get formatted data

The Scrape API allows you to get the data you want from web pages using a single call. You can scrape page content and capture its data in various formats. For detailed usage, checkout the Scrape API Reference.

Basic Markdown Scraping

The simplest way to scrape a webpage is to extract its content as markdown. This is useful when you want to preserve the page’s structure and formatting.

simple_scrape.py

from notte_sdk import NotteClient

client = NotteClient()
markdown = client.scrape(
    url="https://www.notte.cc",
    only_main_content=True,
)
print(markdown)

Structured Data Extraction

For more sophisticated use cases, you can extract structured data from web pages by defining a schema using Pydantic models. This is particularly useful when you need to extract specific information like product details, pricing plans, or article metadata.

Example: Extracting Pricing Plans from `notte.cc`

Let’s say you want to extract pricing information from a website. First, define your data models then use these models to extract structured data:

structured_scrape.py

from notte_sdk import NotteClient
from pydantic import BaseModel


class PricingPlan(BaseModel):
    name: str
    price_per_month: int | None = None
    features: list[str]


class PricingPlans(BaseModel):
    plans: list[PricingPlan]


client = NotteClient()

# plans is a PricingPlans instance directly
# > note that scrape() can raise ScrapeFailedError if extraction fails
plans = client.scrape(
    url="https://www.notte.cc", instructions="Extract the pricing plans from the page", response_format=PricingPlans
)

Agent Scraping

Agent Scraping is a more powerful way to scrape web pages. It allows you to navigate through the page, fill forms, and extract data from dynamic content.

agent_scrape.py

from notte_sdk import NotteClient
from pydantic import BaseModel


class LinkedInConversation(BaseModel):
    recipient: str
    messages: list[str]


client = NotteClient()
vault = client.Vault(vault_id="<your-vault-id>")

with client.Session() as session:
    agent = client.Agent(session=session, vault=vault, max_steps=15)
    response = agent.run(
        task="Got to linkedin.com, login with the credentials and extract the last 10 messages from my most recent conversation",
        response_format=LinkedInConversation,
    )
print(response.answer)

Topics & Tips

Scrape API vs Agent Scrape

Scrape API

Perfect for1. One-off scraping tasks2. Simple data extraction3. Static content

Agent Scrape

Perfect for1. Authentication or login flows2. Form filling and submission3. Dynamic content

Response Format Best Practices

Use response_format whenever possible to yield the best & most reliable results:

Tips for designing schemas:

Try a few different schemas to find what works best
If you ask for a company_name field but there is no company_name on the page, LLM scraping will fail
Design your schema carefully based on the actual content structure
Response format is available for both scrape and agent.run

Example of good schema design:

schema_design.py

from pydantic import BaseModel


class Product(BaseModel):
    product_url: str
    name: str
    price: float | None = None
    description: str | None = None
    image_url: str | None = None


class ProductList(BaseModel):
    products: list[Product]

Getting Started

Sessions

Agents

Functions

Agent Tools

Scraping

Guides

Integrations

Advanced Scraping

Scrape any page and get formatted data

Basic Markdown Scraping

Structured Data Extraction

Example: Extracting Pricing Plans from `notte.cc`

Agent Scraping

Topics & Tips

Scrape API vs Agent Scrape

Scrape API

Agent Scrape

Response Format Best Practices

​Scrape any page and get formatted data

​Basic Markdown Scraping

​Structured Data Extraction

​Example: Extracting Pricing Plans from notte.cc

​Agent Scraping

​Topics & Tips

​Scrape API vs Agent Scrape

Scrape API

Agent Scrape

​Response Format Best Practices

Scrape any page and get formatted data

Basic Markdown Scraping

Structured Data Extraction

Example: Extracting Pricing Plans from `notte.cc`

Agent Scraping

Topics & Tips

Scrape API vs Agent Scrape

Response Format Best Practices