Documentation for crwlr / crawler (v2.0)

Attention: You're currently viewing the documentation for v2.0 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Unique Inputs and Outputs

Sometimes you may have a data source containing the same items multiple times, but you don't want to have duplicates in your results. Just use the uniqueOutputs or uniqueInputs method on any step:

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = new MyCrawler();

$crawler
    ->input('https://example.com/listing')
    ->addStep(Http::get())
    ->addStep(
        Html::getLinks('.item a')
            ->uniqueOutputs()
    );

// Run crawler and process results

With uniqueInputs:

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://example.com/listing')
    ->addStep(Http::get())
    ->addStep(Html::getLinks('.item a'))
    ->addStep(
        Http::get()
            ->uniqueInputs()
    );

// Run crawler and process results

Using a key to check for array/object uniqueness

When the step output is an array (or object) you can improve performance by defining a key that should be used to check for uniqueness:

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = new MyCrawler();

$crawler
    ->input('https://example.com/listing')
    ->addStep(Http::get())
    ->addStep(
        Html::each('.item')
            ->extract([
                'title' => 'h3',
                'price' => '.productPrice',
                'description' => '.text'
            ])
            ->uniqueOutputs('title')
    );

// Run crawler and process results

Because for array (and object) the crawler otherwise internally builds a simple string key to check for uniqueness by serializing and hashing the array/object.

That's also the secret to how this works without bloating memory consumption. The step is still a Generator function, but it internally remembers the string keys that it already yielded.