Documentation for crwlr / crawler (v3.2)

Step Output Filters

Steps extending the abstract Crwlr\Crawler\Steps\Step class provided by the package include where() and orWhere() methods for filtering outputs. Here’s an example demonstrating their use:

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Json;

$json = <<<JSON
{
    "queenAlbums": [
        { "title": "Queen", "year": 1973, "charts": { "uk": 24, "us": 83 } },
        { "title": "Queen II", "year": 1974, "charts": { "uk": 5, "us": 49 } },
        { "title": "A Night at the Opera", "year": 1975, "charts": { "uk": 1, "us": 4 } },
        { "title": "A Day at the Races", "year": 1976, "charts": { "uk": 1, "us": 5 } },
        { "title": "The Game", "year": 1980, "charts": { "uk": 1, "us": 1 } },
        { "title": "A Kind of Magic", "year": 1986, "charts": { "uk": 1, "us": 46 } }
    ]
}
JSON;

$crawler = new MyCrawler();

$crawler->input($json);

$crawler->addStep(
    Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
        ->where('year', Filter::greaterThan(1979))
        ->where('chartsUS', Filter::equal(1))
);

The where() and orWhere methods require two parameters. The first is the key from the step’s output array (or object), used to identify the specific value you want to filter by. If the step produces a single scalar value output (without any keys), you can omit the key and provide only the filter.

The other parameter is a filter object, which defines the condition for filtering. These filters are available as static methods on the Filter class, making them straightforward to use.

In the example above, the result will include only the album "The Game", as it is the only one from the list released after 1979 and that reached #1 on the US charts.

Here’s another example, where a scalar value output is filtered, so no key is provided:

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Html;

$crawler->addStep(
    Html::getLink('.linkClass')
        ->where(Filter::urlDomain('crwlr.software'))
);

As previously mentioned, there is also an orWhere() method. Using the same example as above, you could add an orWhere() like this:

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Json;

$crawler->addStep(
    Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
        ->where('year', Filter::greaterThan(1979))
        ->where('chartsUS', Filter::equal(1))
        ->orWhere('chartsUK', Filter::equal(1))
);

This will also get "A Kind of Magic" as it was #1 in UK.

Negating filters

Any filter can be inverted by using the negate() method, which is available on all filter objects.

use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Sitemap;

Sitemap::getUrlsFromSitemap()
    ->where(
        Filter::urlPathStartsWith('/foo')->negate()
    );

This step filters URLs from a sitemap to include only those whose paths do not start with /foo.

Available Filters

Comparison Filters

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::equal(mixed $toValue);
Filter::notEqual(mixed $value);
Filter::greaterThan(mixed $value);
Filter::greaterThanOrEqual(mixed $value);
Filter::lessThan(mixed $value);
Filter::lessThanOrEqual(mixed $value);

String Filters

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::stringContains(string $string);                 // uses PHP's str_contains()
Filter::stringStartsWith(string $string)                // str_starts_with()
Filter::stringEndsWith(string $string)                  // str_ends_with()
Filter::stringLengthEqual(int $length)                  // strlen($outputValue) === $length
Filter::stringLengthNotEqual(int $length)               // strlen($outputValue) !== $length
Filter::stringLengthGreaterThan(int $length)            // strlen($outputValue) > $length
Filter::stringLengthGreaterThanOrEqual(int $length)     // strlen($outputValue) >= $length
Filter::stringLengthLessThan(int $length)               // strlen($outputValue) < $length
Filter::stringLengthLessThanOrEqual(int $length)        // strlen($outputValue) <= $length

URL Filters

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::urlScheme(string $scheme);              // e.g. http, https, ftp,...
Filter::urlHost(string $host);                  // www.crwlr.software
Filter::urlDomain(string $domain);              // crwlr.software
Filter::urlPath(string $path);                  // /exact/path
Filter::urlPathStartsWith(string $pathStart);   // /foo
Filter::urlPathMatches(string $regex);          // Regex (without delimiters) that the path has to match.
                                                // Like: ^/\d{1,5}/

Array Filters (Nesting)

All filters listed above are applied to simple scalar values. But what if a step produces outputs where a property contains an array? Something like ['foo' => 'value', 'bar' => ['an', 'array', 'of', 'values']]. In such cases, you can use array filters to apply these same filters to the elements within array properties.

Filter::arrayHasElement()

This filter allows you to filter outputs based on whether at least one element of an array property matches specific filter criteria.

For example, imagine your step yields outputs like this:

[
    ['project' => 'foo', 'languages' => ['php', 'javascript']],
    ['project' => 'bar', 'languages' => ['python', 'go']],
    ['project' => 'bar', 'languages' => ['java', 'kotlin']],
]

If you want to include only outputs where the languages array contains php or java, you can write:

$step->where(
    'languages',
    Filter::arrayHasElement()
        ->where(Filter::equal('php'))
        ->orWhere(Filter::equal('java')),
);

Now let’s look at a more complex example, where the array elements are associative arrays with keys, like this:

[
    [
        'project' => 'foo',
        'categories' => [
            ['name' => 'foo', 'id' => '123'],
            ['name' => 'bar', 'id' => '234'],
        ],
    ],
    [
        'project' => 'bar',
        'categories' => [
            ['name' => 'bar', 'id' => '234'],
            ['name' => 'baz', 'id' => '345'],
        ],
    ],
]

You can filter outputs where the categories array contains an element with the name baz like this:

$step
    ->where(
        'categories',
        Filter::arrayHasElement()
            ->where('name', Filter::equal('baz')),
    );

If needed, because your outputs are very complex, you can even nest multiple levels. For instance, instead of Filter::equal('baz') in the example above, you could use another Filter::arrayHasElement().

Custom Filter Callback

Finally, you can create your own custom filter by providing a callback function to the Filter::custom() method.

use Crwlr\Crawler\Steps\Filters\Filter;

Filter::custom(function (mixed $outputValue) {
    if (/* $outputValue should be passed on */) {
        return true;
    }

    return false; // Throw this output away
});