Documentation for crwlr / crawler (v1.6)

Attention: You're currently viewing the documentation for v1.6 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

HTTP Steps

The Http step implements the LoadingStepInterface and automatically receives the crawler's Loader when added to the crawler.

HTTP Requests

There are static methods to get steps for all the different HTTP methods:

use Crwlr\Crawler\Steps\Loading\Http;

Http::get();
Http::post();
Http::put();
Http::patch();
Http::delete();

They all have optional parameters for headers, body (if available for method) and HTTP version:

use Crwlr\Crawler\Steps\Loading\Http;

Http::get(array $headers = [], string $httpVersion = '1.1');

Http::post(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1'
)

Http::put(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Http::patch(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Http::delete(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Getting Headers and/or Body from previous Step

By default, if the step receives array input, it will look for the keys url or uri to use it as the request URL. But to be as flexible as possible, the Http steps can receive not only the URL, but also headers and a body from the outputs of a previous step. Let's say you have a MyCustomStep that produces outputs like:

[
    'link' => 'https://www.example.com',
    'someHeaderValue' => '123abc',
    'queryString' => 'foo=bar&baz=quz',
]

You can get those values to be used as a certain HTTP request header and as the request body, like this:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('...')
    ->addStep(new MyCustomStep())
    ->addStep(
        Http::post()
            ->useInputKeyAsUrl('link')
            ->useInputKeyAsHeader('someHeaderValue', 'x-header-value')
            ->useInputKeyAsBody('queryString')
    );

As you can see you can even map the output key to a certain header name.

You can also use an array from the output, containing multiple headers. Let's assume the output of MyCustomStep looks like:

[
    'link' => 'https://www.example.com',
    'customHeaders' => [
        'Accept' => 'text/html,application/xhtml+xml,application/xml',
        'Accept-Encoding' => 'gzip, deflate',
    ],
]

In this case you can add those headers to your request like this:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('...')
    ->addStep(new MyCustomStep())
    ->addStep(
        Http::post()
            ->useInputKeyAsUrl('link')
            ->useInputKeyAsHeaders('customHeaders')
    );

If you're also defining some headers statically when creating the step, dynamic headers from previous step's outputs are merged with them:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('...')
    ->addStep(new MyCustomStep())
    ->addStep(
        Http::get(headers: ['Accept-Language' => 'de-DE'])
            ->useInputKeyAsUrl('link')
            ->useInputKeyAsHeaders('customHeaders')
    );

Watch out: usually the Http steps receive the request URL as scalar input, or you define, which key from array input should be used by calling the step's useInputKey() method. When you also want to get headers and/or body from the input, you have to use the useInputKeyAsUrl() method, because when using the useInputKey() method, all other values are just thrown away before invoking the step.

Error Responses

By default, error responses (HTTP status code 4xx and 5xx) are not passed on to the next step in the crawler. If you want to also cascade error responses down to the next step, you can call the yieldErrorResponses() method:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/broken-link')
    ->addStep(
        Http::get()->yieldErrorResponses()
    )
    ->addStep(...);

Another default behavior is, that crawlers keep on crawling after error responses (except for some special behaviour in case of a 429 HTTP response, see the Politeness page). If it's important for your crawler that none of the requests fail, call the stopOnErrorResponse() method, and the step will throw a LoadingException in case it receives an error response.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/broken-link')
    ->addStep(
        Http::get()->stopOnErrorResponse()
    )
    ->addStep(...);

Directly adding Response Data to the Result

After an HTTP request step, usually you'll have some step that extracts data from that response document. If you directly want to add some property from the response to the crawling result, you can use the output keys url, status, headers and body:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Http::get()
            ->addToResult(['url', 'status', 'headers', 'body'])
    );

Paginating List Pages

A typical challenge when crawling, is listings with items spread over multiple pages. A convenient way to solve this are Paginators. Here's a simple example how to use it:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/some/listing')
    ->addStep(
        Http::get()->paginate('#pages')
    );

As first argument, the paginate() method either takes a CSS selector string or an instance of the Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator class. With a CSS selector, the method creates an instance of the SimpleWebsitePaginator class. You can either use a CSS selector to select the link to the next page, or just the element containing all the pagination links. The SimpleWebsitePaginator remembers all the URLs it already loaded, so it won't load any link twice. But keep in mind, that pages may not be loaded in the correct order, when selecting a pagination wrapper element.

As the second argument, the paginate() method takes the maximum number of pages it will load. The default value if you don't provide a value yourself, is 1000.

Query Params Paginator

Another paginator implementation shipped with the package is the Crwlr\Crawler\Steps\Loading\Http\Paginators\QueryParamsPaginator. It automatically increases or decreases values of query parameters, either in the URL or in the request body (e.g. with POST requests).

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;

$crawler = new MyCrawler();

$crawler
    ->input('https://www.example.com/list')
    ->addStep(
        Http::post(body: 'page=1&offset=0')
            ->paginate(
                Paginator::queryParams()
                    ->inBody()                  // or ->inUrl() when working with URL query params
                    ->increase('page')
                    ->increase('offset', 20)
            )
    );

In this example, the page query parameter is increase by one (default increase value) after each request, and the offset parameter is increased by 20. You also have the option to decrease parameter values as needed using the decrease() method.

If you're dealing with a nested query string like pagination[page]=1&pagination[size]=25, you can use dot notation to define the query param to increase or decrease:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;

$crawler = new MyCrawler();

$crawler
    ->input('https://www.example.com/list?pagination[page]=1&pagination[size]=25')
    ->addStep(
        Http::get()
            ->paginate(
                Paginator::queryParams()
                    ->inUrl()
                    ->increase('pagination.page')
            )
    );

However, the issue with this example is that it continuously sends requests until it reaches the default limit of 1000 requests (you can customize this limit by specifying it as a method argument: Paginator::queryParams(300)). What we want to do here, is to provide the paginator with a rule that determines when it should stop loading further pages as a reaction to received responses:

Paginator Stop Rules

Suppose the example.com/list endpoint returns a JSON list of books, with book items stored in data.books. When we reach the end of the list, data.books becomes empty, and we want stop loading any further pages. To achieve this, we can do:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;
use Crwlr\Crawler\Steps\Loading\Http\Paginators\StopRules\PaginatorStopRules;

$crawler = new MyCrawler();

$crawler
    ->input('https://www.example.com/list?page=1')
    ->addStep(
        Http::get()
            ->paginate(
                Paginator::queryParams()
                    ->inUrl()
                    ->increase('page')
                    ->stopWhen(PaginatorStopRules::isEmptyInJson('data.books'))
            )
    );

As you can see, you can define a so-called stop rule through the stopWhen() method. These stop rules are applicable to any paginator, as they are implemented in the AbstractPaginator class. The package includes several pre-defined stop rules, such as:

PaginatorStopRules::isEmptyResponse()
// Paginator stops when response body is empty.

PaginatorStopRules::isEmptyInJson('data.items')
// Paginator stops when response is empty, or data.items doesn't exist or is empty in JSON response.

PaginatorStopRules::isEmptyInHtml('#search .list .item')
// Paginator stops when response is empty, or the CSS selector `#search .list .item` does not select any nodes.

PaginatorStopRules::isEmptyInXml('channel item')
// Paginator stops when response is empty, or the CSS selector `channel item` does not select any nodes.

If your use case requires unique criteria, you can also supply a custom Closure.

use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;use Psr\Http\Message\RequestInterface;

$crawler = new MyCrawler();

$crawler
    ->input('https://www.example.com/list?page=1')
    ->addStep(
        Http::get()
            ->paginate(
                Paginator::queryParams()
                    ->inUrl()
                    ->increase('page')
                    ->stopWhen(function (RequestInterface $request, ?RespondedRequest $respondedRequest) {
                        // Based on the $request and the $respondedRequest object provided to the callback
                        // you can decide if the paginator should stop. In this case, return true.

                        return true;
                    })
            )
    );

Custom Paginators

If the paginators shipped with the package don't fit your needs, you can write your own. A Paginator has to extend the Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator class (please make sure the namespace is correct, as there is another version of that class in a different namespace, which is already deprecated) and implement at least a custom getNextRequest() method. Actually, at the moment the AbstractPaginator still contains a default implementation of that method, but it will be removed in v2.0 of the library, so it's better to already provide your own implementation. Optionally you can also implement your custom versions of the methods processLoaded(), hasFinished() and logWhenFinished().

use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\UriInterface;
use Psr\Log\LoggerInterface;

class CustomPaginator extends AbstractPaginator
{
    public function getNextRequest(): ?RequestInterface
    {
        // Let's say we paginate URLs with a path like this: /list-of-things/<pageNumber>

        $latestRequestUrlPath = $this->latestRequest->getUri()->getPath();

        $prevPageNumber = explode('/list-of-things/', $latestRequestUrlPath);

        if (count($prevPageNumber) < 2) {
            return null;
        }

        $nextPageNumber = ((int) $prevPageNumber[1]) + 1;

        return $this->latestRequest->withUri(
            $this->latestRequest->getUri()->withPath('/list-of-things/' . $nextPageNumber)
        );
    }
}

You can then use your Paginator class like this:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginators\StopRules\PaginatorStopRules;

$crawler
    ->input('https://www.example.com/list-of-things/1')
    ->addStep(
        Http::get()
            ->paginate(new CustomPaginator())
            ->stopWhen(PaginatorStopRules::isEmptyInHtml('#results .item'))
    );

Crawling (whole Websites)

If you want to crawl a whole website the Http::crawl() step is for you. By default, it just follows all the links it finds until everything on the same host is loaded.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/')
    ->addStep(Http::crawl());

Depth

You can also tell it to only follow links to a certain depth.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->depth(2)
    );

This means, it will load all the URLs it finds on the page from the initial input (in this case https://www.example.com/), then all the links it finds on those found links, and then it'll stop. With a depth of 3 it will load another level of newly found links.

Start with a sitemap

By using the inputIsSitemap() method, you can start crawling with a sitemap.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/sitemap.xml')
    ->addStep(
        Http::crawl()
            ->inputIsSitemap()
    );

The crawl step usually assumes that all input URLs will deliver HTML documents, so if you want to start crawling with a sitemap, the call to this method is necessary.

Load URLs on the same domain (instead of host)

As mentioned, by default, it loads all the pages on the same host. So, for example www.example.com. If there's a link to https://jobs.example.com/foo, it won't follow that link, as it is on jobs.example.com. But you can tell it to also load all URLs on the same domain, using the sameDomain() method:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->sameDomain()
    );

Only load URLs matching path criteria

There's two methods that you can use to tell it, to only load URLs with certain paths:

pathStartsWith()

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathStartsWith('/foo/')
    );

In this case it will only load found URLs where the path starts with /foo/, so for example: https://www.example.com/foo/bar, but not https://www.example.com/other/bar.

pathMatches()

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathMatches('/\/bar\//')
    );

The pathMatches() method takes a regex to match the paths of found URLs. So in this case it will load all URLs containing /bar/ anywhere in the path.

Custom Filtering based on URL or Link Element

The customFilter() method allows you to define your own callback function that will be called with any found URL or link:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->customFilter(function (Url $url) {
                return $url->scheme('https');
            })
    );

So, this example will only load URLs where the URL scheme is https.

In case the URL was found in an HTML document (not in a sitemap), the Closure also receives the link element as a Symfony DomCrawler instance as the second argument:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;
use Symfony\Component\DomCrawler\Crawler;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->customFilter(function (Url $url, ?Crawler $linkElement) {
                return $linkElement && str_contains($linkElement->innerText(), 'Foo');
            })
    );

So, this example will only load links when the link text contains Foo.

Load all URLs but yield only matching

When restricting crawling e.g. to only paths starting with /foo/, it will only load matching URLs (after the initial input URL). So if some page /some/page contains a link to /foo/quz, the link won't be found, because the /some/page won't be loaded. If you want to find all links matching your criteria, on the whole website, but yield only the responses of the matching URLs, you can use the loadAllButYieldOnlyMatching() method.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathStartsWith('/foo')
            ->loadAllButYieldOnlyMatching()
    );

This works for restrictions defined using the path methods (pathStartsWith() and pathMatches()) and also for the customFilter() method. Of course, it doesn't affect depth or staying on the same host or domain.

If a websites delivers the same content via multiple URLs (for example like example.com/products?productId=123 and example.com/products/123), it can use canonical links to tell crawlers if a page is a duplicate of another one and which one is the main URL. If you want to avoid loading the same document multiple times, you can tell the Http::crawl() step to use canonical links, calling its useCanonicalLinks() method.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/foo')
    ->addStep(
        Http::crawl()
            ->useCanonicalLinks()
    );

Calling that method, the step will not yield responses if its canonical link URL was already yielded before. If it discovers a link, and some document pointing to that URL via canonical link was already loaded, the newly discovered link is treated as if it was already loaded. Further this feature also sets the canonical link URL as the effectiveUri of the response.

Keep URL Fragments

By default, the Http::crawl() step throws away the fragment part of all discovered URLs (example.com/path#fragment => example.com/path), because websites only very rarely respond with different content based on the fragment part. If a site that you're crawling does so, you can tell the step to keep the URL fragment, calling the keepUrlFragment() method.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/something')
    ->addStep(
        Http::crawl()
            ->keepUrlFragment()
    );