Composing Results
Sometimes the output of the last crawler step alone will not be the whole result data you want to get from your crawler. It may be necessary to compose the final result from different steps (/pages). For example when you want to get jobs from a job listing and most of the data about the jobs is found on the job posting detail page, but the job location is only mentioned in the listing. This is why it's possible to compose results over multiple steps.
First you should know, that the Crawler internally
wraps input and output data in Input
and Output
objects
between the steps. But what you're finally receiving at the
end from the Crawler::run()
method is a Result
object.
When you don't define anything what you want to get as result
it just converts the outputs of the last step to results.
When you actively define what exactly a step shall add to
the final result, the crawler creates a Result
object at the
first step that adds something and carries it along with the
Input
and Output
objects. The following steps can then
add properties to the existing result object.
Behaviour of Result objects in the data flow
In case some step along the way yields multiple outputs, the
Result
object is passed on to all the outputs, but only
as a reference, so it remains one Result
object. And at the
end the crawler will only give you the one Result
object.
If data is added in the area where it is attached to
multiple outputs, the data is added to the result property
as an array.
How to define results
There are two different ways to tell a step that it
should add data to the final Result
object.
For Steps with Array Output
Most steps that extract data, yield arrays as output. So in
most cases the way to go is the addKeysToResult
method of
the step.
use Crwlr\Crawler\Steps\Html;
$myCrawler->addStep(
Html::each('.jobAd')
->extract(['title' => 'a', 'location' => '.location'])
->addKeysToResult()
);
This will add all it's keys to the final result.
If you need to extract some data only for the next step,
but don't want to add it to the final result, you can add
only some keys:
use Crwlr\Crawler\Steps\Html;
$myCrawler->addStep(
Html::each('.jobAd')
->extract([
'title' => 'a',
'location' => '.location',
'salary' => '.salary',
])
->addKeysToResult(['title', 'location'])
);
For Steps with Scalar Output
If a step yields only a single value that is not an array
you add it to the final result via the setResultKey()
method of the step.
use Crwlr\Crawler\Steps\Html;
$myCrawler->addStep(
Html::getLink('#someLink')
->setResultKey('url')
);
Or as an alternative syntax you can also use:
use Crwlr\Crawler\Steps\Html;
$myCrawler->addStep('url', Html::getLink('#someLink'));