Building Custom Steps
When you need your crawler to do something that is not covered
by an included step, just build your own. A custom step
needs to implement the StepInterface
but
for convenience just extend the abstract
Step
class, because you don't have to
worry about all the methods that the crawler needs internally.
What you need to define yourself is the protected invoke
method.
use Crwlr\Crawler\Steps\Step;
class MyStep extends Step
{
protected function invoke(mixed $input): Generator
{
// Implement what the step should do.
}
}
What's coming in as $input
is either one of the input
values you manually defined if this is the first step in
your crawler, or one of the outputs of the step that is
executed before this one.
Validating and Sanitizing Input
So, theoretically this could be anything, which is why you
can also add your own validateAndSanitizeInput()
method.
There you can validate if the step can somehow deal with
the input (and otherwise throw an
InvalidArgumentException
) and also sanitize it, so in
the invoke method you'll know what's inside $input
.
Let's assume the step does something with an HTML document and therefore wants to get an instance of the Symfony DomCrawler. The HTML source code string could be delivered in various ways, e.g. in a PSR-7 Response object or simply just as string,...
use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Step;
use Psr\Http\Message\ResponseInterface;
use Symfony\Component\DomCrawler\Crawler;
class MyStep extends Step
{
protected function validateAndSanitizeInput(mixed $input): mixed
{
if (is_string($input)) {
return new Crawler($input);
}
if ($input instanceof ResponseInterface || $input instanceof RespondedRequest) {
// Avoid using ->getBody()->getContents() directly, because if it
// is used again at a later point you'd first have to rewind the
// stream to get the body again.
// Better always use this Http::getBodyString() helper method to
// get the body as string from an HTTP message.
return new Crawler(Http::getBodyString($input));
}
throw new InvalidArgumentException('Input must be string, PSR-7 Response or RespondedRequest.');
}
/**
* @param Crawler $input
* @return Generator
*/
protected function invoke(mixed $input): Generator
{
// Implement what the step should do.
}
}
The abstract Step
class takes care of internally calling
both methods and handing over the return value of the
validateAndSanitizeInput()
method to the invoke()
method,
when the crawler calls the step.
Yielding output
If you're not familiar with PHP generators you can read about them here.
Assuming you want to make a step that splits a string into separate lines and pass the lines as separate outputs (inputs) to the next step, it would look like this:
use Crwlr\Crawler\Steps\Step;
class MyStep extends Step
{
/**
* @param string $input
* @return Generator
*/
protected function invoke(mixed $input): Generator
{
foreach (explode(PHP_EOL, $input) as $line) {
yield $line;
}
}
}