Documentation for crwlr / crawler (v3.2)

Steps and Data Flow

Steps are the fundamental building blocks for your crawlers. There are many ready-to-use steps available, and you can also create custom ones. Crawlers accept any class that implements the StepInterface as a step.

When a crawler is run, it calls one step after another with some input. Typically, you manually define the initial inputs for the first step. Most often, this will be one or more URLs that need to be fetched.

$crawler = new MyCrawler();

$crawler->inputs([
    'https://www.crwlr.software/packages/url',
    'https://www.crwlr.software/packages/robots-txt'
]);

Any step can produce one, zero, or multiple outputs from a single input it receives.

A step yielding one output
A step without output
A step yielding multiple outputs

Subsequent steps added to the crawler are called with the outputs of the previous step as their inputs.

Animation showing how and when output is converted to input again for the next step

So, the data (inputs and outputs) flows through the crawling procedure, cascading from one step to the next.

Example


use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = new MyCrawler();

$crawler->inputs([
    'https://www.crwlr.software/packages/url',
    'https://www.crwlr.software/packages/robots-txt'
]);

$crawler->addStep(Http::get())
    ->addStep(Html::getLinks('#versions a'))
    ->addStep(Http::get())
    ->addStep(
        Html::first('article')
            ->extract(['title' => 'h1'])
    );

foreach ($crawler->run() as $result) {
    // do something with result
}
Visualization showing the complete data flow through a whole crawling procedure with multiple steps