Getting Started

What is this Library for?

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

To give you an overview, here's a list of things that it helps you with:

Crawler Politeness 😇 (respecting robots.txt, throttling,...)
Load pages/resources (from URLs) using
- a (PSR-18) HTTP client (default is of course Guzzle)
- or a headless browser (chrome) to get source after Javascript execution
Get absolute links from HTML documents 🔗
Get sitemaps from robots.txt and get all URLs from those sitemaps
Crawl (load) all pages of a website 🕷
Use cookies (or don't) 🍪
Use any HTTP methods (GET, POST,...) and send any headers or body
Iterate over paginated list pages 🔁
Extract data from:
- HTML and also XML (using CSS selectors or XPath queries)
- JSON (using dot notation)
- CSV (map columns)
Extract schema.org structured data in JSON-LD format from HTML documents
Keep memory usage low by using PHP Generators 💪
Cache HTTP responses during development, so you don't have to load pages again and again after every code change
Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)
And a lot more...

What is the Difference between Crawling and Scraping

Before diving into the library, let's have a look at the terms crawling and scraping. For most real world use cases, those two things go hand in hand, which is why this library helps with and combines both.

What is a Crawler?

Animated visualization of a crawling procedure

A (web) crawler is a program that (down)loads documents and follows the links in it to load them as well. A crawler could just load actually all links it is finding (and is allowed to load according to the robots.txt file), then it would just load the whole internet (if the URL(s) it starts with are no dead end). Or it can be restricted to load only links matching certain criteria (on same domain/host, URL path starts with "/foo",...) or only to a certain depth. A depth of 3 means 3 levels deep. Links found on the initial URLs provided to the crawler are level 1 and so on.

What is a Scraper?

Visualization of extracting data from a document

A scraper extracts data from a document. Crawling only gets you the documents that you're looking for, but in most use cases you also want to extract certain data from those documents which is called scraping.

That being said: in this project the term crawling is preferred, but most of the time it also includes scraping. The class that you need to extend is called Crawler but it's here for both, crawling and scraping.

Requirements

Requires PHP version 8.1 or above.

Installation

composer require crwlr/crawler

Usage

To build your first crawler you need two things:

First you need a crawler class where you define at least its user agent. For everything else the library has good defaults.
And then you need to instantiate that class and define the crawling procedure by adding steps that it should perform.

Your crawler class must extend the Crawler or better the HttpCrawler class. As mentioned, with the HttpCrawler class you only need to define the user agent it should identify as. You can either use a BotUserAgent to identify as a bot, or just a normal UserAgent which can be any browser user-agent string for example. If you use a BotUserAgent, the HttpCrawler will automatically load and respect the robots.txt file for any host you're loading URLs on. You can read more about the built-in politeness features here.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return BotUserAgent::make('MyBot');
    }
}

A very simple example for a crawling procedure, to extract data from some articles linked on list pages, looks like this:

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = new MyCrawler();

$crawler->input('https://www.example.com/articles')
    ->addStep(Http::get())                              // Load the listing page
    ->addStep(Html::getLinks('#artList .article a'))    // Get the links to the articles
    ->addStep(Http::get())                              // Load the article pages
    ->addStep(
        Html::first('article')                          // Extract the data
            ->extract([
                'title' => 'h1',
                'date' => '.date',
                'author' => '.articleAuthor'
            ])
    );

foreach ($crawler->run() as $result) {
    // Do something with the Result
}

You can see a very central concept are the so-called "steps". A key thing to understand, to use this library, will be how data flows through those steps.

Assuming the listing contains 3 articles, running this crawler via command line will give you an output like this:

08:57:40:123456 [INFO] Loaded https://www.example.com/robots.txt
08:57:41:123456 [INFO] Loaded https://www.example.com/articles
08:57:42:123456 [INFO] Loaded https://www.example.com/articles/1
08:57:43:123456 [INFO] Loaded https://www.example.com/articles/2
08:57:44:123456 [INFO] Loaded https://www.example.com/articles/3

The final crawling results that the run() method returns are wrapped in the Result class. You can get a certain result property via its get() method, or just use the toArray() method to get the result as an array.

foreach ($crawler->run() as $result) {
    $title = $result->get('title');

    // or

    $resultArray = $result->toArray();

    // array(3) {
    //   ["title"]=>
    //   string(10) "Some Title"
    //   ["date"]=>
    //   string(10) "2022-09-30"
    //   ["author"]=>
    //   string(15) "Christian Olear"
    // }
}

For more convenient processing of results, have a look at the stores feature.

Documentation for crwlr / crawler (v2.1)