Documentation for crwlr / crawler (v0.6)

Attention: You're currently viewing the documentation for v0.6 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

HTTP Steps

The Http step implements the LoadingStepInterface and automatically receives the crawler's Loader when added to the crawler.

HTTP Requests

There are static methods to get steps for all the different HTTP methods:

use Crwlr\Crawler\Steps\Loading\Http;

Http::get();
Http::post();
Http::put();
Http::patch();
Http::delete();

They all have optional parameters for headers, body (if available for method) and HTTP version:

use Crwlr\Crawler\Steps\Loading\Http;

Http::get(array $headers = [], string $httpVersion = '1.1');

Http::post(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1'
)

Http::put(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Http::patch(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Http::delete(
    array $headers = [],
    string|StreamInterface|null $body = null,
    string $httpVersion = '1.1',
);

Crawling (whole Websites)

If you want to crawl a whole website the Http::crawl() step is for you. By default, it just follows all the links it finds until everything on the same host is loaded.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(Http::crawl());

Depth

You can also tell it to only follow links to a certain depth.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->depth(2)
    );

This means, it will load all the URLs it finds on the page from the initial input (in this case https://www.example.com/), then all the links it finds on those found links, and then it'll stop. With a depth of 3 it will load another level of newly found links.

Start with a sitemap

By using the inputIsSitemap() method, you can start crawling with a sitemap.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/sitemap.xml')
    ->addStep(
        Http::crawl()
            ->inputIsSitemap()
    );

The crawl step usually assumes that all input URLs will deliver HTML documents, so if you want to start crawling with a sitemap, the call to this method is necessary.

Load URLs on the same domain (instead of host)

As mentioned, by default, it loads all the pages on the same host. So, for example www.example.com. If there's a link to https://jobs.example.com/foo, it won't follow that link, as it is on jobs.example.com. But you can tell it to also load all URLs on the same domain, using the sameDomain() method:

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->sameDomain()
    );

Only load URLs matching path criteria

There's two methods that you can use to tell it, to only load URLs with certain paths:

pathStartsWith()

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathStartsWith('/foo/')
    );

In this case it will only load found URLs where the path starts with /foo/, so for example: https://www.example.com/foo/bar, but not https://www.example.com/other/bar.

pathMatches()

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathMatches('/\/bar\//')
    );

The pathMatches() method takes a regex to match the paths of found URLs. So in this case it will load all URLs containing /bar/ anywhere in the path.

Custom Filtering based on URL or Link Element

The customFilter() method allows you to define your own callback function that will be called with any found URL or link:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->customFilter(function (Url $url) {
                return $url->scheme() === 'https';
            })
    );

So, this example will only load URLs where the URL scheme is https.

In case the URL was found in an HTML document (not in a sitemap), the Closure also receives the link element as a Symfony DomCrawler instance as the second argument:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;
use Symfony\Component\DomCrawler\Crawler;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->customFilter(function (Url $url, ?Crawler $linkElement) {
                return $linkElement && str_contains($linkElement->innerText(), 'Foo');
            })
    );

So, this example will only load links when the link text contains Foo.

Load all URLs but yield only matching

When restricting crawling e.g. to only paths starting with /foo/, it will only load matching URLs (after the initial input URL). So if some page /some/page contains a link to /foo/quz, the link won't be found, because the /some/page won't be loaded. If you want to find all links matching your criteria, on the whole website, but yield only the responses of the matching URLs, you can use the loadAllButYieldOnlyMatching() method.

use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/')
    ->addStep(
        Http::crawl()
            ->pathStartsWith('/foo')
            ->loadAllButYieldOnlyMatching()
    );

This works for restrictions defined using the path methods (pathStartsWith() and pathMatches()) and also for the customFilter() method. Of course, it doesn't affect depth or staying on the same host or domain.