Documentation for crwlr / crawler (v1.2)

Attention: You're currently viewing the documentation for v1.2 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

HTTP Steps

The Http step implements the LoadingStepInterface and automatically receives the crawler's Loader when added to the crawler.

HTTP Requests

There are static methods to get steps for all the different HTTP methods:

use Crwlr\Crawler\Steps\Loading\Http;

Http::get();
Http::post();
Http::put();
Http::patch();
Http::delete();

They all have optional parameters for headers, body (if available for method) and HTTP version:

use Crwlr\Crawler\Steps\Loading\Http;

Http::get(array $headers = [], string $httpVersion = '1.1');

Http::post(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1'
)

Http::put(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Http::patch(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Http::delete(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Getting Headers and/or Body from previous Step

By default, if the step receives array input, it will look for the keys url or uri to use it as the request URL. But to be as flexible as possible, the Http steps can receive not only the URL, but also headers and a body from the outputs of a previous step. Let's say you have a MyCustomStep that produces outputs like:

[
    'link' => 'https://www.example.com',
    'someHeaderValue' => '123abc',
    'queryString' => 'foo=bar&baz=quz',
]

You can get those values to be used as a certain HTTP request header and as the request body, like this:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('...')
    ->addStep(new MyCustomStep())
    ->addStep(
        Http::post()
            ->useInputKeyAsUrl('link')
            ->useInputKeyAsHeader('someHeaderValue', 'x-header-value')
            ->useInputKeyAsBody('queryString')
    );

As you can see you can even map the output key to a certain header name.

You can also use an array from the output, containing multiple headers. Let's assume the output of MyCustomStep looks like:

[
    'link' => 'https://www.example.com',
    'customHeaders' => [
        'Accept' => 'text/html,application/xhtml+xml,application/xml',
        'Accept-Encoding' => 'gzip, deflate',
    ],
]

In this case you can add those headers to your request like this:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('...')
    ->addStep(new MyCustomStep())
    ->addStep(
        Http::post()
            ->useInputKeyAsUrl('link')
            ->useInputKeyAsHeaders('customHeaders')
    );

If you're also defining some headers statically when creating the step, dynamic headers from previous step's outputs are merged with them:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('...')
    ->addStep(new MyCustomStep())
    ->addStep(
        Http::get(headers: ['Accept-Language' => 'de-DE'])
            ->useInputKeyAsUrl('link')
            ->useInputKeyAsHeaders('customHeaders')
    );

Watch out: usually the Http steps receive the request URL as scalar input, or you define, which key from array input should be used by calling the step's useInputKey() method. When you also want to get headers and/or body from the input, you have to use the useInputKeyAsUrl() method, because when using the useInputKey() method, all other values are just thrown away before invoking the step.

Error Responses

By default, error responses (HTTP status code 4xx and 5xx) are not passed on to the next step in the crawler. If you want to also cascade error responses down to the next step, you can call the yieldErrorResponses() method:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/broken-link')
    ->addStep(
        Http::get()->yieldErrorResponses()
    )
    ->addStep(...);

Another default behavior is, that crawlers keep on crawling after error responses (except for some special behaviour in case of a 429 HTTP response, see the Politeness page). If it's important for your crawler that none of the requests fail, call the stopOnErrorResponse() method, and the step will throw a LoadingException in case it receives an error response.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/broken-link')
    ->addStep(
        Http::get()->stopOnErrorResponse()
    )
    ->addStep(...);

Directly adding Response Data to the Result

After an HTTP request step, usually you'll have some step that extracts data from that response document. If you directly want to add some property from the response to the crawling result, you can use the output keys url, status, headers and body:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Http::get()
            ->addToResult(['url', 'status', 'headers', 'body'])
    );

Paginating List Pages

A typical challenge when crawling, is listings with multiple pages on different URLs. A convenient way to solve this are Paginators. Here's a simple example how to use it:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/some/listing')
    ->addStep(
        Http::get()->paginate('#pages')
    );

As first argument, the paginate() method either takes a CSS selector string or an instance of the PaginatorInterface (more about this below). With a CSS selector, the method creates an instance of the SimpleWebsitePaginator class. You can either use a CSS selector to select the link to the next page, or just the element containing all the pagination links. The SimpleWebsitePaginator remembers all the URLs it already loaded, so it won't load any link twice. But keep in mind, that pages may not be loaded in the correct order, when selecting a pagination wrapper element.

As the second argument, the paginate() method takes the maximum number of pages it will load. The default value if you don't provide a value yourself, is 1000.

Custom Paginators

The SimpleWebsitePaginator currently is the only Paginator shipped with the package, but if it doesn't fit your needs you can write your own. A Paginator has to implement the PaginatorInterface. You can also extend the AbstractPaginator class that comes with a constructor taking a max pages argument and a default implementation of the PaginatorInterface::prepareRequest() method, that just returns the incoming request without changes.

use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\UriInterface;
use Psr\Log\LoggerInterface;

class CustomPaginator extends AbstractPaginator
{
    public function hasFinished(): bool
    {
        // This method is called after each page load to check if we're finished loading all pages.
    }

    public function getNextUrl(): ?string
    {
        // Return the next URL that should be loaded, or null if there is no further page to load.
    }

    public function prepareRequest(
        RequestInterface $request,
        ?RespondedRequest $previousResponse = null,
    ): RequestInterface {
        // Here you can manipulate a request before it is sent.
        // So you can e.g. also solve use cases where serving different pages is done via POST requests.

        // But, implementing this method is optional, when you extend the AbstractPaginator class!
    }

    public function processLoaded(
        UriInterface $url,
        RequestInterface $request,
        ?RespondedRequest $respondedRequest,
    ): void {
        // This method is called after a page was loaded.
        // Here you can process the response and get further links to load.
    }

    public function logWhenFinished(LoggerInterface $logger): void
    {
        // This method is called when hasFinished() returned true. Here you can log some messages if you want to.
    }
}

You can then use your Paginator class like this:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/some/listing')
    ->addStep(
        Http::get()->paginate(new CustomPaginator())
    );

Or another example, if you've built a Paginator for a use case where different pages are served based on POST parameters:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/some/listing')
    ->addStep(
        Http::post()->paginate(new MyPostParamPaginator())
    );

Crawling (whole Websites)

If you want to crawl a whole website the Http::crawl() step is for you. By default, it just follows all the links it finds until everything on the same host is loaded.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/')
    ->addStep(Http::crawl());

Depth

You can also tell it to only follow links to a certain depth.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->depth(2)
    );

This means, it will load all the URLs it finds on the page from the initial input (in this case https://www.example.com/), then all the links it finds on those found links, and then it'll stop. With a depth of 3 it will load another level of newly found links.

Start with a sitemap

By using the inputIsSitemap() method, you can start crawling with a sitemap.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/sitemap.xml')
    ->addStep(
        Http::crawl()
            ->inputIsSitemap()
    );

The crawl step usually assumes that all input URLs will deliver HTML documents, so if you want to start crawling with a sitemap, the call to this method is necessary.

Load URLs on the same domain (instead of host)

As mentioned, by default, it loads all the pages on the same host. So, for example www.example.com. If there's a link to https://jobs.example.com/foo, it won't follow that link, as it is on jobs.example.com. But you can tell it to also load all URLs on the same domain, using the sameDomain() method:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->sameDomain()
    );

Only load URLs matching path criteria

There's two methods that you can use to tell it, to only load URLs with certain paths:

pathStartsWith()

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathStartsWith('/foo/')
    );

In this case it will only load found URLs where the path starts with /foo/, so for example: https://www.example.com/foo/bar, but not https://www.example.com/other/bar.

pathMatches()

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathMatches('/\/bar\//')
    );

The pathMatches() method takes a regex to match the paths of found URLs. So in this case it will load all URLs containing /bar/ anywhere in the path.

Custom Filtering based on URL or Link Element

The customFilter() method allows you to define your own callback function that will be called with any found URL or link:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->customFilter(function (Url $url) {
                return $url->scheme('https');
            })
    );

So, this example will only load URLs where the URL scheme is https.

In case the URL was found in an HTML document (not in a sitemap), the Closure also receives the link element as a Symfony DomCrawler instance as the second argument:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;
use Symfony\Component\DomCrawler\Crawler;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->customFilter(function (Url $url, ?Crawler $linkElement) {
                return $linkElement && str_contains($linkElement->innerText(), 'Foo');
            })
    );

So, this example will only load links when the link text contains Foo.

Load all URLs but yield only matching

When restricting crawling e.g. to only paths starting with /foo/, it will only load matching URLs (after the initial input URL). So if some page /some/page contains a link to /foo/quz, the link won't be found, because the /some/page won't be loaded. If you want to find all links matching your criteria, on the whole website, but yield only the responses of the matching URLs, you can use the loadAllButYieldOnlyMatching() method.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathStartsWith('/foo')
            ->loadAllButYieldOnlyMatching()
    );

This works for restrictions defined using the path methods (pathStartsWith() and pathMatches()) and also for the customFilter() method. Of course, it doesn't affect depth or staying on the same host or domain.

If a websites delivers the same content via multiple URLs (for example like example.com/products?productId=123 and example.com/products/123), it can use canonical links to tell crawlers if a page is a duplicate of another one and which one is the main URL. If you want to avoid loading the same document multiple times, you can tell the Http::crawl() step to use canonical links, calling its useCanonicalLinks() method.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/foo')
    ->addStep(
        Http::crawl()
            ->useCanonicalLinks()
    );

Calling that method, the step will not yield responses if its canonical link URL was already yielded before. If it discovers a link, and some document pointing to that URL via canonical link was already loaded, the newly discovered link is treated as if it was already loaded. Further this feature also sets the canonical link URL as the effectiveUri of the response.

Keep URL Fragments

By default, the Http::crawl() step throws away the fragment part of all discovered URLs (example.com/path#fragment => example.com/path), because websites only very rarely respond with different content based on the fragment part. If a site that you're crawling does so, you can tell the step to keep the URL fragment, calling the keepUrlFragment() method.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/something')
    ->addStep(
        Http::crawl()
            ->keepUrlFragment()
    );