Documentation for crwlr / crawler (v1.8)

Attention: You're currently viewing the documentation for v1.8 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Step Output Filters

Steps extending the abstract Crwlr\Crawler\Steps\Step class provided by the package include where() and orWhere() methods for filtering outputs. Here’s an example demonstrating their use:

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Json;

$json = <<<JSON
{
    "queenAlbums": [
        { "title": "Queen", "year": 1973, "charts": { "uk": 24, "us": 83 } },
        { "title": "Queen II", "year": 1974, "charts": { "uk": 5, "us": 49 } },
        { "title": "A Night at the Opera", "year": 1975, "charts": { "uk": 1, "us": 4 } },
        { "title": "A Day at the Races", "year": 1976, "charts": { "uk": 1, "us": 5 } },
        { "title": "The Game", "year": 1980, "charts": { "uk": 1, "us": 1 } },
        { "title": "A Kind of Magic", "year": 1986, "charts": { "uk": 1, "us": 46 } }
    ]
}
JSON;

$crawler = new MyCrawler();

$crawler->input($json);

$crawler->addStep(
    Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
        ->where('year', Filter::greaterThan(1979))
        ->where('chartsUS', Filter::equal(1))
);

The where() and orWhere methods require two parameters. The first is the key from the step’s output array (or object), used to identify the specific value you want to filter by. If the step produces a single scalar value output (without any keys), you can omit the key and provide only the filter.

The other parameter is a filter object, which defines the condition for filtering. These filters are available as static methods on the Filter class, making them straightforward to use.

In the example above, the result will include only the album "The Game", as it is the only one from the list released after 1979 and that reached #1 on the US charts.

Here’s another example, where a scalar value output is filtered, so no key is provided:

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Html;

$crawler->addStep(
    Html::getLink('.linkClass')
        ->where(Filter::urlDomain('crwlr.software'))
);

As previously mentioned, there is also an orWhere() method. Using the same example as above, you could add an orWhere() like this:

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Json;

$crawler->addStep(
    Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
        ->where('year', Filter::greaterThan(1979))
        ->where('chartsUS', Filter::equal(1))
        ->orWhere('chartsUK', Filter::equal(1))
);

This will also get "A Kind of Magic" as it was #1 in UK.

Negating filters

Any filter can be inverted by using the negate() method, which is available on all filter objects.

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Sitemap;

Sitemap::getUrlsFromSitemap()
    ->where(
        Filter::urlPathStartsWith('/foo')->negate()
    );

This step filters URLs from a sitemap to include only those whose paths do not start with /foo.

Available Filters

Comparison Filters

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::equal(mixed $toValue);
Filter::notEqual(mixed $value);
Filter::greaterThan(mixed $value);
Filter::greaterThanOrEqual(mixed $value);
Filter::lessThan(mixed $value);
Filter::lessThanOrEqual(mixed $value);

String Filters

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::stringContains(string $string);                 // uses PHP's str_contains()
Filter::stringStartsWith(string $string)                // str_starts_with()
Filter::stringEndsWith(string $string)                  // str_ends_with()
Filter::stringLengthEqual(int $length)                  // strlen($outputValue) === $length
Filter::stringLengthNotEqual(int $length)               // strlen($outputValue) !== $length
Filter::stringLengthGreaterThan(int $length)            // strlen($outputValue) > $length
Filter::stringLengthGreaterThanOrEqual(int $length)     // strlen($outputValue) >= $length
Filter::stringLengthLessThan(int $length)               // strlen($outputValue) < $length
Filter::stringLengthLessThanOrEqual(int $length)        // strlen($outputValue) <= $length

URL Filters

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::urlScheme(string $scheme);              // e.g. http, https, ftp,...
Filter::urlHost(string $host);                  // www.crwlr.software
Filter::urlDomain(string $domain);              // crwlr.software
Filter::urlPath(string $path);                  // /exact/path
Filter::urlPathStartsWith(string $pathStart);   // /foo
Filter::urlPathMatches(string $regex);          // Regex (without delimiters) that the path has to match.
                                                // Like: ^/\d{1,5}/

Custom Filter Callback

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::custom(function (mixed $outputValue) {
    if (/* $outputValue should be passed on */) {
        return true;
    }

    return false; // Throw this output away
});