Documentation - Data Extraction

Extract data with CSS selector

You can also discover this feature using our Postman collection covering every ScrapingBee's features.

Basic usage

If you want to extract data from pages and don't want to parse the HTML on your side, you can add extraction rules to your API call.

The simplest way to use extraction rules is to use the following format

{"key_name" : "css_selector"}

For example, if you wish to extract the title and subtitle of our blog, you will need to use those rules.

{
    "title" : "h1",
    "subtitle" : "#subtitle",
}

And this will be the JSON response

{
    "title" : "The ScrapingBee Blog",
    "subtitle" : "We help you get better at web-scraping: detailed tutorial, case studies and writing by industry experts",
}

Important: extraction rules are JSON formatted, and in order to pass them to a GET request, you need to stringify them.

Here is how to extract the above information in your favorite language.

# Install the Python ScrapingBee library:
# pip install scrapingbee

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key='YOUR-API-KEY')
response = client.get(
    'https://www.scrapingbee.com/blog',
    params={
        'extract_rules':{"title": "h1", "subtitle": "#subtitle"},
    },
)
print('Response HTTP Status Code: ', response.status_code)
print('Response HTTP Response Body: ', response.content)
// request Axios
const axios = require('axios');

axios.get('https://app.scrapingbee.com/api/v1', {
    params: {
        'api_key': 'YOUR-API-KEY',
        'url': 'https://www.scrapingbee.com/blog',
        'extract_rules': '{"title":"h1","subtitle":"#subtitle"}',
    }
}).then(function (response) {
    // handle success
    console.log(response);
})
String encoded_url = URLEncoder.encode("YOUR URL", "UTF-8");
require 'net/http'
require 'net/https'
require 'uri'

# Classic (GET )
def send_request
    extract_rules = URI::encode('{"title": "h1", "subtitle": "#subtitle"}')
    uri = URI('https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=' + extract_rules)

    # Create client
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true
    http.verify_mode = OpenSSL::SSL::VERIFY_PEER

    # Create Request
    req =  Net::HTTP::Get.new(uri)

    # Fetch Request
    res = http.request(req)
    puts "Response HTTP Status Code: #{ res.code }"
    puts "Response HTTP Response Body: #{ res.body }"
rescue StandardError => e
    puts "HTTP Request failed (#{ e.message })"
end

send_request()
<?php

// get cURL resource
$ch = curl_init();

// set url
$extract_rules = urlencode('{"title": "h1", "subtitle": "#subtitle"}');

curl_setopt($ch, CURLOPT_URL, 'https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=' . $extract_rules);

// set method
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');

// return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);



// send the request and save response to $response
$response = curl_exec($ch);

// stop if fails
if (!$response) {
  die('Error: "' . curl_error($ch) . '" - Code: ' . curl_errno($ch));
}

echo 'HTTP Status Code: ' . curl_getinfo($ch, CURLINFO_HTTP_CODE) . PHP_EOL;
echo 'Response Body: ' . $response . PHP_EOL;

// close curl resource to free up system resources
curl_close($ch);

?>
package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
    "net/url"
)

func sendClassic() {
	// Create client
	client := &http.Client{}


    // Stringify rules
    extract_rules := url.QueryEscape(`{"title": "h1", "subtitle": "#subtitle"}`)
	// Create request
	req, err := http.NewRequest("GET", "https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=" + extract_rules, nil)


	parseFormErr := req.ParseForm()
	if parseFormErr != nil {
		fmt.Println(parseFormErr)
	}

	// Fetch Request
	resp, err := client.Do(req)

	if err != nil {
		fmt.Println("Failure : ", err)
	}

	// Read Response Body
	respBody, _ := ioutil.ReadAll(resp.Body)

	// Display Results
	fmt.Println("response Status : ", resp.Status)
	fmt.Println("response Headers : ", resp.Header)
	fmt.Println("response Body : ", string(respBody))
}

func main() {
    sendClassic()
}

Please note that using:

{
    "title" : "h1",
}

Is the same as using:

{
    "title" : {
        "selector": "h1",
        "output": "text",
        "type": "item"
    }
}

Below are more details about all those different options.



Output Format

output [ text | html | @...] (default= text)

For a given selector, you can extract different kind of data using the output option:

  • text: text content of selector (default)
  • html: HTML content of selector
  • @...: attribute of selector (prefixed by @)

Below is an example of different output option using the same selector.

{
    "title_text" : {
        "selector": "h1",
        "output": "text"
    },
    "title_html" : {
        "selector": "h1",
        "output": "html"
    },
    "title_id" : {
        "selector": "h1",
        "output": "@id"
    },
}

The information extracted by the above rules on ScrapingBee's blog page will be

{
    "title_text": "The ScrapingBee Blog",
    "title_html": "<h1 id=\"the-scrapingbee-blog\"<The <a href=\"https://www.scrapingbee.com/\"<ScrapingBee</a< Blog</h1<",
    "title_id": "the-scrapingbee-blog"
}


Single element or list

type [ item | list ] (default= item)

By default, we will return you the first HTML element that match the selector. If you want to get all elements matching the selector, you should use the type option. type can be:

  • item return first element matching the selector (default)
  • list return a list of all elements matching the selector

Here is an example for extracting post title from our blog.

{
    "first_post_title" : {
        "selector": ".post-title",
        "type": "item"
    },
    "all_post_title" : {
        "selector": ".post-title",
        "type": "list"
    },
}

The information extracted by the above rules on ScrapingBee's blog page would be

{
  "first_post_title": "  Block ressources with Puppeteer - (5min)",
  "all_post_title": [
    "  Block ressources with Puppeteer - (5min)",
    "  Web Scraping vs Web Crawling: Ultimate Guide - (10min)",
    ...
    "  Scraping E-Commerce Product Data - (6min)",
    "  Introduction to Chrome Headless with Java - (4min)"
  ]
}


Clean Text

clean [ true | false ] (default= true)

By default, we will return you the content returned will be cleaned. Meaning we will remove trailing spaces, and empty character from it ('\n', '\t', etc...). If you don't to enable this behavior, you should use the clean: false option.

Here is an example for extracting post description from our blog using "clean": true.

{
    "first_post_description" : {
        "selector": ".card > div",
        "clean": true #default
    }
}

The information extracted by the above rules on ScrapingBee's blog page would be

{
    "first_post_description": "How to Use a Proxy with Python Requests? - (7min) By Maxine Meurer 13 October 2021 In this tutorial we will see how to use a proxy with the Requests package. We will also discuss on how to choose the right proxy provider.read more",
}

If you use "clean": false.

{
    "first_post_description" : {
        "selector": ".card > div",
        "clean": false
    }
}

You would get this result instead:

{
    "first_post_description": "\n                How to Use a Proxy with Python Requests? - (7min)\n        \n            \n            \n            By Maxine Meurer\n            \n            \n            13 October 2021\n            \n        \n        In this tutorial we will see how to use a proxy with the Requests package. We will also discuss on how to choose the right proxy provider.\n        read more\n        ",
}

Extract nested items

It is also possible to add extraction rules inside the output option in order to create powerful extractors.

Here are the rules that would extract general information and all blog post details from ScrapingBee's blog.

{
    "title" : "h1",
    "subtitle" : "#subtitle",
    "articles": {
        "selector": ".card",
        "type": "list",
        "output": {
            "title": ".post-title",
            "link": {
                "selector": ".post-title",
                "output": "@href"
            },
            "description": ".post-description"
        }
    }
}

The information extracted by the above rules on ScrapingBee's blog page would be

{
  "title": "The ScrapingBee Blog",
  "subtitle": " We help you get better at web-scraping: detailed tutorial, case studies and \n                        writing by industry experts",
  "articles": [
    {
      "title": "  Block ressources with Puppeteer - (5min)",
      "link": "https://www.scrapingbee.com/blog/block-requests-puppeteer/",
      "description": "This article will show you how to intercept and block requests with Puppeteer using the request interception API and the puppeteer extra plugin."
    },
    ...
    {
      "title": "  Web Scraping vs Web Crawling: Ultimate Guide - (10min)",
      "link": "https://www.scrapingbee.com/blog/scraping-vs-crawling/",
      "description": "What is the difference between web scraping and web crawling? That's exactly what we will discover in this article, and the different tools you can use."
    },
  ]
}

Common use cases

Below you will find common extraction rules often used by our users

For SEO purposes, lead generation, or simply data harvesting it can be useful to quickly extract all links from a single page.

The following extract_rules will allow you to do that with one simple API call:

{
    "all_links" : {
        "selector": "a",
        "type": "list",
        "output": "@href"
    }
}

The JSON response will be as follow:

{
    "all_links": [
        "https://www.scrapingbee.com/",
        ...,
        "https://www.scrapingbee.com/api-store/"
    ]
}

If you wish to extract both the href and the anchors of links you can use those rules instead:

{
    "all_links" : {
        "selector": "a",
        "type": "list",
        "output": {
            "anchor": "a",
            "href": {
                "selector": "a",
                "output": "@href"
            }
        }
    }
}

The JSON response will be as follow:

{
   "all_links":[
      {
         "link":"Blog",
         "anchor":"https://www.scrapingbee.com/blog/"
      },
      ...
      {
         "link":" Linkedin ",
         "anchor":"https://www.linkedin.com/company/26175275/admin/"
      }
   ]
}

Extract all text from a page

If you need to get all the text of a web page, and only the text, meaning no HTML tags or attributes, you can use those rules:

{
    "text": "body"
}

For example, using those rules with this ScrapingBee landing page returns this result:

{
    "text": "Login Sign Up Pricing FAQ Blog Other Features Screenshots Google search API Data extraction JavaScript scenario No code scraping with Integromat Documentation Tired of getting blocked while scraping the web? ScrapingBee API handles headless browsers and rotates proxies for you. Try ScrapingBee for Free based on 25+ reviews. Render your web page as if it were a real browser. We manage thousands of headless instances using the latest Chrome version. Focus on extracting the data you need, and not dealing with concurrent headless browsers that will eat up all your RAM and CPU. Latest Chrome version Fast, no matter what! ScrapingBee simplified our day-to-day marketing and engineering operations a lot . We no longer have to worry about managing our own fleet of headless browsers, and we no longer have to spend days sourcing the right proxy provider Mike Ritchie CEO @ SeekWell Javascript Rendering We render Javascript with a simple parameter so you can scrape every website, even Single Page Applications using React, AngularJS, Vue.js or any other libraries. Execute custom JS snippet Custom wait for all JS to be executed ScrapingBee is helping us scrape many job boards and company websites without having to deal with proxies or chrome browsers. It drastically simplified our data pipeline Russel Taylor CEO @ HelloOutbound Rotating Proxies Thanks to our large proxy pool, you can bypass rate limiting website, lower the chance to get blocked and hide your bots! Large proxy pool Geotargeting Automatic proxy rotation ScrapingBee clear documentation, easy-to-use API, and great success rate made it a no-brainer. Dominic Phillips Co-Founder @ CodeSubmit Three specific ways to use ScrapingBee How our customers use our API: 1. ..."
}

Extract all email addresses from a page

If you need to get all the email addresses of a web page you can use those rules:

{
    "email_addresses": {
        "selector": "a[href^='mailto']",
        "output": "@href",
        "type": "list"
    }
}

Using those rules with this ScrapingBee landing page returns this result:

{
    "email_addresses": [
        "mailto:contact@scrapingbee.com"
    ]
}

How does this work?

First, we target all anchor (a tag) that has and href attribute that starts with the string mailto, then we decide to only extract the href attribute. And since we want all email addresses on the page and not just one, we use the type list (on ScrapingBee landing page there is just one email address anyway). 

Limitation

Those rules will only work for links whose href attributes contain mailto. If the email addresses on the page are just plain text or simple anchors. Then you should either extract all the text on the page an run some regular expression or extract all link's on the page and filter for email addresses on your side.