What's new in crwlr / crawler v0.2 and v0.3
There are already two new 0.x versions of the crawler package. Here a quick summary of what's new in versions 0.2 and 0.3.
v0.2.0
uniqueOutputs() Step Method
Sometimes you'll have data sources containing the same items
multiple times, but you don't want to have duplicates in your
results. By calling uniqueOutputs()
on any step you can now
very easily prevent getting duplicate outputs even though the
steps are still generator functions.
$crawler = new MyCrawler();
$crawler
->input('https://example.com/listing')
->addStep(Http::get())
->addStep(
Html::getLinks('.item a')
->uniqueOutputs()
);
// Run crawler and process results
When the output of a step is an array or even an object, you can also define a key on that array/object that can be used to check for uniqueness.
$crawler = new MyCrawler();
$crawler
->input('https://example.com/listing')
->addStep(Http::get())
->addStep(
Html::each('.item')
->extract([
'title' => 'h3',
'price' => '.productPrice',
'description' => '.text'
])
->uniqueOutputs('title')
);
// Run crawler and process results
This can improve performance because otherwise it will create simple string keys for any array/object by serializing and hashing it, to check for uniqueness.
This is also the secret how it's done with Generator functions: the step internally remembers the keys it's already yielded. This memory is reset when the crawler run is finished.
runAndTraverse() on the Crawler
As a result of using generators, you need to iterate the
results, the run()
method returns, otherwise nothing will
happen when calling that method.
But often you won't actually need to do anything with the
results, where you're calling the crawler, because you've
set a store to store the
results or maybe the crawler even just needs to call some urls,
but you don't need any results.
So, to avoid having loops with empty body, or using PHP's
iterator_to_array,
or things like that, you can now use runAndTraverse()
:
$myCrawler->runAndTraverse();
v0.3.0
Monitoring memory usage
The library is built to be as memory efficient as possible, but as crawlers typically are programs dealing with vast amounts of data, you can still potentially hit memory limits. When that happens, but you're not really sure why, you can now tell the crawler to log messages about its current memory usage after every step invocation, to maybe get a hint what's causing it:
$crawler->monitorMemoryUsage();
Or if it should only log messages once the memory usage exceeds X bytes:
$crawler->monitorMemoryUsage(1000000000);
So it won't pollute your logs while it's not really necessary.
Fixes
Both new versions also contain a fixes/improvements, especially v0.3 fixes how generators are used internally to be as memory efficient as possible.