Documentation for crwlr / crawler (v0.4)

Attention: You're currently viewing the documentation for v0.4 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Grouping Steps

Groups are here, so you can run two or more different steps on the same input. For example you may want to extract data from an Html document using CSS selectors and also to get some data from JSON-LD structured data from within the same document, using a custom step that you've built. No problem, just make a group:

$crawler->input('https://www.example.com/blog-post-with-json-ld');

$crawler->addStep(Http::get())
    ->addStep(
        Crawler::group()
            ->addStep(
                Html::first('#content article.blog-post')
                    ->extract(['title' => 'h1', 'date' => '.date'])
            )
            ->addStep(new StructuredDataBlogPost())
            ->combineToSingleOutput()
            ->addKeysToResult()
    );

$result = iterator_to_array($crawler->run());

Crawler::group() creates a Group object that you can add steps to, just like to the crawler itself. The Group object also implements the StepInterface, so it can just be added to the crawler as any other normal step.

combineToSingleOutput()

The group step will call one step after the other with the input from the previous step. By default the steps in the group will pass on their outputs separately to the next step after the group. For our example this means we would get two separate outputs: one containing the title and date, and the other containing all the data we're extracting from contained JSON-LD structured data. Most of the time you'll just want to get one single output combining all the data, which can we can achieve by using combineToSingleOutput() on the group.

Group default behaviour
Group default behaviour
When using combineToSingleOutput()
When using combineToSingleOutput()

Prevent steps from cascading their output to the next step

You can actually use this on any step, but it probably only makes sense within the context of a group. You can call the dontCascade method on a step, and it will do what it usually does, but if it yields output, it will not be handed over to the next step.

This makes sense when you need to call something that you don't really need the output from, but it's necessary as preparation for the actual step that should cascade it's output.

Crawler::group()
    ->addStep(
        (new StepToPrepareSomething())
            ->dontCascade()
    )
    ->addStep(
        (new StepThatCascadesItOutput)
    )

If you don't need the output for the next step after the group, but in some form for the next step within the group, we've got you covered:

Manipulate/Prepare the original input for further steps

Another method that is only useful within the context of a group is updateInputUsingOutput(). Most likely it actually is useful in combination with dontCascade(). Let's have a look at this:

Crawler::group()
    ->addStep(
        (new StepToPrepareSomething())
            ->dontCascade()
            ->updateInputUsingOutput(function (mixed $input, mixed $output) {
                // do something with the $input data

                return $input;
            })
    )
    ->addStep(
        (new StepThatCascadesItOutput)
    )

As mentioned, by default in a group all the steps receive the same input, from the previous step. Using the updateInputUsingOutput() method on the step, that is only here to prepare something for the step that'll actually deliver needed data, you can further prepare the original input data for the following step(s).