Unique Inputs and Outputs
Sometimes you may have a data source containing the same items multiple times, but you don't want to have duplicates in your results. Just use the uniqueOutputs
or uniqueInputs
method on any step:
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler = new MyCrawler();
$crawler
->input('https://example.com/listing')
->addStep(Http::get())
->addStep(
Html::getLinks('.item a')
->uniqueOutputs()
);
// Run crawler and process results
With uniqueInputs
:
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://example.com/listing')
->addStep(Http::get())
->addStep(Html::getLinks('.item a'))
->addStep(
Http::get()
->uniqueInputs()
);
// Run crawler and process results
Using a key to check for array/object uniqueness
When the step output is an array (or object) you can improve performance by defining a key that should be used to check for uniqueness:
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler = new MyCrawler();
$crawler
->input('https://example.com/listing')
->addStep(Http::get())
->addStep(
Html::each('.item')
->extract([
'title' => 'h3',
'price' => '.productPrice',
'description' => '.text'
])
->uniqueOutputs('title')
);
// Run crawler and process results
Because for array (and object) the crawler otherwise internally builds a simple string key to check for uniqueness by serializing and hashing the array/object.
That's also the secret to how this works without bloating memory consumption. The step is still a Generator
function, but it internally remembers the string keys that it already yielded.