Grouping Steps
Groups are here, so you can run two or more different steps on the same input. For example you may want to extract data from an Html document using CSS selectors and also to get some data from JSON-LD structured data from within the same document, using a custom step that you've built. No problem, just make a group:
use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler->input('https://www.example.com/blog-post-with-json-ld');
$crawler->addStep(Http::get())
->addStep(
Crawler::group()
->addStep(
Html::first('#content article.blog-post')
->extract(['title' => 'h1', 'date' => '.date'])
)
->addStep(new StructuredDataBlogPost())
->combineToSingleOutput()
->addKeysToResult()
);
$result = iterator_to_array($crawler->run());
Crawler::group()
creates a Group
object that you can add steps to, just like to the
crawler itself. The Group
object also implements the
StepInterface
, so it can just be
added to the crawler as any other normal step.
combineToSingleOutput()
The group step will call one step after the other with
the input from the previous step. By default the steps
in the group will pass on their outputs separately to
the next step after the group. For our example this
means we would get two separate outputs: one containing
the title and date, and the other containing all the
data we're extracting from contained JSON-LD
structured data. Most of the time you'll just want to
get one single output combining all the data, which
can we can achieve by using combineToSingleOutput()
on the group.
Prevent steps from cascading their output to the next step
You can actually use this on any step, but it probably
only makes sense within the context of a group. You
can call the dontCascade
method on a step, and it
will do what it usually does, but if it yields output,
it will not be handed over to the next step.
This makes sense when you need to call something that you don't really need the output from, but it's necessary as preparation for the actual step that should cascade it's output.
use Crwlr\Crawler\Crawler;
Crawler::group()
->addStep(
(new StepToPrepareSomething())
->dontCascade()
)
->addStep(
(new StepThatCascadesItOutput)
)
If you don't need the output for the next step after the group, but in some form for the next step within the group, we've got you covered:
Manipulate/Prepare the original input for further steps
Another method that is only useful within the context
of a group is updateInputUsingOutput()
. Most likely
it actually is useful in combination with
dontCascade()
. Let's have a look at this:
use Crwlr\Crawler\Crawler;
Crawler::group()
->addStep(
(new StepToPrepareSomething())
->dontCascade()
->updateInputUsingOutput(function (mixed $input, mixed $output) {
// do something with the $input data
return $input;
})
)
->addStep(
(new StepThatCascadesItOutput)
)
As mentioned, by default in a group all the steps
receive the same input, from the previous step. Using
the updateInputUsingOutput()
method on the step,
that is only here to prepare something for the step
that'll actually deliver needed data, you can further
prepare the original input data for the following
step(s).