Custom Steps
Creating a Custom Step Class
When you need your crawler to perform a task not covered by any included step, you can easily build your own. Your custom step class needs to extend the abstract Crwlr\Crawler\Steps\Step
class, and you need to implement the invoke()
and outputType()
methods.
use Crwlr\Crawler\Steps\Step;
class MyStep extends Step
{
public function outputType(): StepOutputType
{
return StepOutputType::Scalar;
}
protected function invoke(mixed $input): Generator
{
// Implement what the step should do and yield output values.
yield 'foo';
}
}
Further information about these two methods below.
Yielding Step Output Data Using Generators
As you can see in the invoke()
method, instead of return
ing the output values, we use the yield
keyword to pass them on. If you're not familiar with PHP generators you can read our quickstart tutorial on PHP generators.
For instance, to create a step that splits a string into separate lines and passes each line as a separate output (input) to the next step, it would look like this:
use Crwlr\Crawler\Steps\Step;
class MyStep extends Step
{
public function outputType(): StepOutputType
{
return StepOutputType::Scalar;
}
/**
* @param string $input
* @return Generator
*/
protected function invoke(mixed $input): Generator
{
foreach (explode(PHP_EOL, $input) as $line) {
yield $line;
}
}
}
Step Output Types
Each step must also implement the outputType()
method, returning a Crwlr\Crawler\Steps\StepOutputType
enum. There are three options:
StepOutputType::Scalar
StepOutputType::AssociativeArrayOrObject
orStepOutputType::Mixed
Understanding the types of outputs a step can yield is important for the crawler to detect misconfigurations (such as using the wrong keep methods on steps) early on, before even starting to actually crawl. This helps prevent errors that might occur after the crawler has already been running for some time.
Decide the output type this way:
- If you know your step will only yield associative arrays or objects, return
StepOutputType::AssociativeArrayOrObject
. - If it will only yield scalar values, like
string
,int
,float
,bool
, returnStepOutputType::Scalar
. - If it could yield either scalar or non-scalar values based on the instance's state, return the corresponding type from the
outputType()
method based on the current state of the instance (see example below). - If it's not possible to determine the output type, e.g., because it also depends on the inputs it is called with, return
StepOutputType::Mixed
.
Here's an example of an outputType()
implementation that determines the output type based on the state of the step instance:
class MyStep extends Step
{
public bool $yieldsScalarValues = true;
public function yieldScalarValues(): self
{
$this->yieldsScalarValues = true;
return $this;
}
public function yieldAssociativeArrays(): self
{
$this->yieldsScalarValues = false;
return $this;
}
public function outputType(): StepOutputType
{
if ($this->yieldsScalarValues) {
return StepOutputType::Scalar;
}
return StepOutputType::AssociativeArrayOrObject;
}
protected function invoke(mixed $input): Generator
{
if ($this->yieldsScalarValues) {
yield 'foo';
} else {
yield ['foo' => 'bar'];
}
}
}
Validating and Sanitizing Input
The $input
argument of the invoke()
method is either an initial input value you manually defined, if this is the first step in your crawler, or an output value from the preceding step. So, theoretically, it can be any value. In order to build your step for reusability, you can implement a validateAndSanitizeInput()
method. This method allows you to validate whether the step can handle the input (throwing an InvalidArgumentException
if it can't) and sanitize it, ensuring the invoke()
method receives a predictable input.
Let's assume the step processes an HTML document and requires an instance of the Symfony DomCrawler. The HTML source code string could be delivered in various formats, such as a PSR-7 Response object or a plain string.
Here’s an example:
use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Step;
use Psr\Http\Message\ResponseInterface;
use Symfony\Component\DomCrawler\Crawler;
class MyStep extends Step
{
protected function validateAndSanitizeInput(mixed $input): mixed
{
if (is_string($input)) {
return new Crawler($input);
}
if ($input instanceof ResponseInterface || $input instanceof RespondedRequest) {
// Avoid using ->getBody()->getContents() directly, as you would
// need to rewind the stream to retrieve the body again later.
// Instead, use the Http::getBodyString() helper method to get
// the body as a string from an HTTP message.
return new Crawler(Http::getBodyString($input));
}
throw new InvalidArgumentException('Input must be string, PSR-7 Response or RespondedRequest.');
}
/**
* @param Crawler $input
* @return Generator
*/
protected function invoke(mixed $input): Generator
{
// Implement the step's functionality here.
}
}
The abstract Step
class ensures that both methods are called internally. It passes the return value of the validateAndSanitizeInput()
method to the invoke()
method when the crawler calls the step.