Unique Output
Sometimes you may have a data source containing the same items
multiple times, but you don't want to have duplicates in your
results. Just use the uniqueOutputs
method on any step:
$crawler = new MyCrawler();
$crawler
->input('https://example.com/listing')
->addStep(Http::get())
->addStep(
Html::getLinks('.item a')
->uniqueOutputs()
);
// Run crawler and process results
Using a key to check for array/object uniqueness
When the step output is an array (or object) you can improve performance by defining a key that should be used to check for uniqueness:
$crawler = new MyCrawler();
$crawler
->input('https://example.com/listing')
->addStep(Http::get())
->addStep(
Html::each('.item')
->extract([
'title' => 'h3',
'price' => '.productPrice',
'description' => '.text'
])
->uniqueOutputs('title')
);
// Run crawler and process results
Because for array (and object) the crawler otherwise internally builds a simple string key to check for uniqueness by serializing and hashing the array/object.
That's also the secret to how this works without bloating
memory consumption. The step is still a Generator
function, but it internally remembers the string keys that
it already yielded.