Documentation for crwlr / crawler (v2.1)

The library comes with some pretty handy features to make your life easier while working on a crawler:

Step::maxOutputs()

The maxOutputs() method of the abstract Step class allows you to limit how many outputs a Step will produce at most. When the limit is reached every call to invoke the step is stopped immediately, so it doesn't do unnecessary work.

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = new MyCrawler();

$crawler->input('https://www.crwlr.software/packages/crawler/v0.4/getting-started')
    ->addStep(Http::get())
    ->addStep(Html::getLinks('main nav a'))
    ->addStep(Http::get()->maxOutputs(10))
    ->addStep(Html::root()->extract(['title' => 'h1']));

So if you're building a crawler for a big source, that will produce a lot of outputs you can easily reduce the amount of data for test runs during development.

Crawler::outputHook()

When running a crawler you'll get either the composed result or the outputs of the last step as results. So if the results are not as you would expect them to be, you may want to check the outputs of the previous steps. To make debugging easier here you can use the outputHook() method. Using that method you can set a Closure that will be called with every output of every step. To know which output is coming from which step, the Closure also receives the index of the step and also the step object itself.

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = new MyCrawler();

$crawler->input('https://www.crwlr.software/packages/crawler/v0.4/getting-started')
    ->addStep(Http::get())                      // stepIndex 0
    ->addStep(Html::getLinks('main nav a'))     // stepIndex 1
    ->addStep(Http::get())                      // stepIndex 2
    ->outputHook(function (Output $output, int $stepIndex, StepInterface $step) {
        if ($stepIndex === 1) {
            var_dump($output->get());
        }
    });