Looping Steps
There is one very typical use case where the simple cascading steps would force you to build a custom step to solve it. You may already guess that it's about pagination. Don't worry, there's a simpler solution to crawl paginated listings, namely loops.
$crawler->input('https://www.example.com/listing');
// Loop through the result pages as long as there is a link
// with id nextPage
$crawler->addStep(
Crawler::loop(Http::get())
->withInput(Html::getLink('#nextPage'))
);
// Further parsing of the listing pages and extracting items
$crawler->addStep('url', Html::getLinks('#listing .item a'))
->addStep(Http::get())
->addStep(
Html::first('.someElement')
->extract([
'title' => 'h1',
'id' => '#itemId'
])
->addKeysToResult()
);
foreach ($crawler->run() as $result) {
// Do something with results.
}
Let's take a closer look at the loop step in this example:
Wrapping any step in a Loop
step using Crawler::loop()
,
will loop it with its own output until the step doesn't yield
any output anymore.
Crawler::loop(Http::get());
For the pagination case being able to loop the step with its own output is still not enough, because the Http step needs an url as input.
The withInput() hook
Here the withInput() hook comes to the rescue. It takes a Closure or even another step as callback and when the loop step has output it first calls that callback/step with the output and passes the result of the callback back as input to the loop step.
When the callback returns null
the loop stops.
Crawler::loop(Http::get())
->withInput(Html::getLink('#nextPage'))
So, the example uses a step to get a link with id
nextPage
from the loaded page. If there is no
such link, the loop stops.
As mentioned you can also just use a Closure like:
Crawler::loop(Http::get())
->withInput(function (mixed $input, mixed $output) {
// $input is the original input of the loop step,
// and $output of course the $output that it yielded.
// The callback is also bound to the underlying
// step that is being looped, so you can also use
// the logger via $this->logger.
// return whatever you need to pass as new input to the loop step
})
As you can see, the withInput
callback receives not
only the output, but also the original input, that the
loop step was called with. This is useful for example
when you want to keep some kind of state in the
input.
Example: Let's assume you're having a custom step
before the loop step, that reads a list of categories
that you then want to loop through. That custom step
could yield a custom class with a __toString
method
returning an url for the current category. In the
withInput
method you set the pointer of that class
in the input to the next category item and pass it on
to the next iteration of the loop step. If there is
no next category, you return null
in the withInput
method, so the loop stops.
keepLoopingWithoutOutput()
Let's add a detail to the example above: some category
pages can return 404 responses. Let's assume that's
normal, because the categories you're getting from the
previous step aren't guaranteed to exist. Normally a
404 would make the loop stop, because when there is no
output, it assumes the loop is finished. In this case
just use keepLoopingWithoutOutput
:
Crawler::loop(Http::get())
->withInput(function (mixed $input, mixed $output) {
// When the iteration actually has no output,
// this callback is still called. The $output
// argument is null in this case.
// When you return null from here, it still
// stops the loop.
})
->keepLoopingWithoutOutput();
stopIf()
As you already know all the outputs from the step that
is being looped, are passed on as inputs to the next
step just like any normal step does.
And you can manually stop a loop by returning
null
from the withInput hook callback. But the
last output that is triggering the callback is still
being passed on to the next step even though it's
not used for another loop iteration. If you want to
prevent this, you can add a callback using the stopIf
method:
Crawler::loop(Http::get())
->stopIf(function (mixed $input, mixed $output) {
$responseContent = $output->response->getBody()->getContents();
// An important thing to know, always rewind response body
// streams after reading the content, if you'll need it
// again somewhere else. When using it again without
// rewinding you'll just get an empty string.
$output->response->getBody()->rewind();
return $responseContent === '{ "success": false }';
})
->keepLoopingWithoutOutput();
Prevent infinite loops with a max iterations limit
It's easy to somehow end up in an infinite loop. It may
not even be your fault. You can define it should stop
when there is no link with id #nextPage
on a loaded
page, and it currently works. And then the site owners
decide to add a #nextPage
link on the last page,
linking to the same last page again, or something like
this.
Therefore, it's good to set some limit, defining how
often the loop is allowed to iterate at most. The
default limit if you don't set anything else,
automatically is 1000
. You can set your own limit
using maxIterations()
:
Crawler::loop(Http::get())
->withInput(Html::getLink('#nextPage'))
->maxIterations(40000);
Defer cascading outputs to the next step
As this library is working with Generators,
an output from one step may well call the next step
before the loop has finished. If you need the loop
to wait until it's done looping and then start to
pass on all the outputs to the next step, use
cascadeWhenFinished
:
Crawler::loop(Http::get())
->withInput(Html::getLink('#nextPage'))
->cascadeWhenFinished();