Step Output Filters
Steps extending the abstract Crwlr\Crawler\Steps\Step
class provided by the package include where()
and orWhere()
methods for filtering outputs. Here’s an example demonstrating their use:
use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Json;
$json = <<<JSON
{
"queenAlbums": [
{ "title": "Queen", "year": 1973, "charts": { "uk": 24, "us": 83 } },
{ "title": "Queen II", "year": 1974, "charts": { "uk": 5, "us": 49 } },
{ "title": "A Night at the Opera", "year": 1975, "charts": { "uk": 1, "us": 4 } },
{ "title": "A Day at the Races", "year": 1976, "charts": { "uk": 1, "us": 5 } },
{ "title": "The Game", "year": 1980, "charts": { "uk": 1, "us": 1 } },
{ "title": "A Kind of Magic", "year": 1986, "charts": { "uk": 1, "us": 46 } }
]
}
JSON;
$crawler = new MyCrawler();
$crawler->input($json);
$crawler->addStep(
Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
->where('year', Filter::greaterThan(1979))
->where('chartsUS', Filter::equal(1))
);
The where()
and orWhere
methods require two parameters. The first is the key from the step’s output array (or object), used to identify the specific value you want to filter by. If the step produces a single scalar value output (without any keys), you can omit the key and provide only the filter.
The other parameter is a filter object, which defines the condition for filtering. These filters are available as static methods on the Filter
class, making them straightforward to use.
In the example above, the result will include only the album "The Game", as it is the only one from the list released after 1979 and that reached #1 on the US charts.
Here’s another example, where a scalar value output is filtered, so no key is provided:
use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Html;
$crawler->addStep(
Html::getLink('.linkClass')
->where(Filter::urlDomain('crwlr.software'))
);
As previously mentioned, there is also an orWhere()
method. Using the same example as above, you could add an orWhere()
like this:
use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Json;
$crawler->addStep(
Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
->where('year', Filter::greaterThan(1979))
->where('chartsUS', Filter::equal(1))
->orWhere('chartsUK', Filter::equal(1))
);
This will also get "A Kind of Magic" as it was #1 in UK.
Negating filters
Any filter can be inverted by using the negate()
method, which is available on all filter objects.
use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\Steps\Sitemap;
Sitemap::getUrlsFromSitemap()
->where(
Filter::urlPathStartsWith('/foo')->negate()
);
This step filters URLs from a sitemap to include only those whose paths do not start with /foo
.
Available Filters
Comparison Filters
use Crwlr\Crawler\Steps\Filters\Filter;
Filter::equal(mixed $toValue);
Filter::notEqual(mixed $value);
Filter::greaterThan(mixed $value);
Filter::greaterThanOrEqual(mixed $value);
Filter::lessThan(mixed $value);
Filter::lessThanOrEqual(mixed $value);
String Filters
use Crwlr\Crawler\Steps\Filters\Filter;
Filter::stringContains(string $string); // uses PHP's str_contains()
Filter::stringStartsWith(string $string) // str_starts_with()
Filter::stringEndsWith(string $string) // str_ends_with()
Filter::stringLengthEqual(int $length) // strlen($outputValue) === $length
Filter::stringLengthNotEqual(int $length) // strlen($outputValue) !== $length
Filter::stringLengthGreaterThan(int $length) // strlen($outputValue) > $length
Filter::stringLengthGreaterThanOrEqual(int $length) // strlen($outputValue) >= $length
Filter::stringLengthLessThan(int $length) // strlen($outputValue) < $length
Filter::stringLengthLessThanOrEqual(int $length) // strlen($outputValue) <= $length
URL Filters
use Crwlr\Crawler\Steps\Filters\Filter;
Filter::urlScheme(string $scheme); // e.g. http, https, ftp,...
Filter::urlHost(string $host); // www.crwlr.software
Filter::urlDomain(string $domain); // crwlr.software
Filter::urlPath(string $path); // /exact/path
Filter::urlPathStartsWith(string $pathStart); // /foo
Filter::urlPathMatches(string $regex); // Regex (without delimiters) that the path has to match.
// Like: ^/\d{1,5}/
Custom Filter Callback
use Crwlr\Crawler\Steps\Filters\Filter;
Filter::custom(function (mixed $outputValue) {
if (/* $outputValue should be passed on */) {
return true;
}
return false; // Throw this output away
});