Documentation for crwlr / crawler (v2.0)

Attention: You're currently viewing the documentation for v2.0 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Grouping Steps

Visualization of a step group
Combine multiple steps to act
like one in a group step

Groups are here, so you can call two or more different steps with the same input. A group step, when invoked, calls all the steps in it one by one, and combines their outputs to one big group step output array.

Example: You may want to extract data from an HTML document using CSS selectors, and also to get some data from JSON-LD structured data from a <script> block within the same document. No problem, just make a group:

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/blog-post-with-json-ld');
    ->addStep(Http::get())
    ->addStep(
        Crawler::group()
            ->addStep(
                Html::first('#content article.blog-post')
                    ->extract([
                        'title' => 'h1',
                        'date' => '.date',
                    ])
            )
            ->addStep(
                Html::schemaOrg()
                    ->onlyType('BlogPosting')
                    ->extract([
                        'description',
                        'author' => 'author.name',
                    ])
            )
    );

Crawler::group() creates a Group object that you can add steps to, just like to the crawler itself. The Group object also implements the internal StepInterface, so it can be added to the crawler like any other normal step.

In the example above, both steps produce array output, that the group merges to a combined group step output array like:

[
    'title' => 'Blog post title',
    'date' => '2022-01-12',
    'description' => 'This is a very sophisticated blog post about rocket science.',
    'author' => 'Christian Olear',
]

Assigning an Output Key to Scalar Output Steps

In case you want to use a step that produces scalar (non array) outputs in a group, you need to assign a key that it's output value will have in the combined output array. You can do so by calling the outputKey() method on the step.

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Html;

Crawler::group()
    ->addStep(
        Html::first('article.jobAd')
            ->extract([
                'title' => 'h1',
                'location' => '.',
            ])
    )
    ->addStep(
        Html::getLink('#applyButton')
            ->outputKey('applyLink')    /* Assign key to the output value */
    )

Prevent Steps from Adding to the Combined Output

You can call the excludeFromGroupOutput() method on a step, and it will do what it usually does, but if it yields output, it will not be added to the combined group step output.

This makes sense when you need to call something that you don't really need the output from, but it's necessary as preparation for the actual step that produces relevant output.

use Crwlr\Crawler\Crawler;

Crawler::group()
    ->addStep(
        (new StepToPrepareSomething())->excludeFromGroupOutput()
    )
    ->addStep(
        (new StepWithRelevantOutput())
    )

If you don't need the output for the next step after the group, but in some form for the next step within the group, we've got you covered:

Manipulate/Prepare the Original Input for Further Steps

Another method that is only useful within the context of a group is updateInputUsingOutput(). You will likely use it in combination with excludeFromGroupOutput(). Let's have a look at this:

use Crwlr\Crawler\Crawler;

Crawler::group()
    ->addStep(
        (new StepToPrepareSomething())
            ->excludeFromGroupOutput()
            ->updateInputUsingOutput(function (mixed $input, mixed $output) {
                // do something with the $input data

                return $input;
            })
    )
    ->addStep(
        (new StepWithRelevantOutput())
    )

As mentioned, by default in a group all the steps receive the same input, from the previous step. Using the updateInputUsingOutput() method on the step, that is only here to prepare something for the step that will actually deliver needed data, you can further prepare the original input data for the following step(s).