Documentation for crwlr / crawler (v2.0)

Attention: You're currently viewing the documentation for v2.0 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Upgrade from v1.x to 2.0

Removed Step Methods for Result Composition

Likelihood Of Impact: High

The deprecated methods addToResult(), addLaterToResult(), and keepInputData() have been removed from the BaseStep class. These calls can be replaced with the new keep methods:

// Change:
$step->addToResult();
// to
$step->keep();

// and
$step->addToResult(['foo', 'bar']);
// to
$step->keep(['foo', 'bar']);

// When assigning keys to scalar output values without key:
$step->addToResult('foo');
// change to
$step->keepAs('foo');

// When keeping data from step inputs:
$step->keepInputData();
// change to
$step->keepFromInput();

// This even became more flexible. You can now pick keys from the input:
$step->keepFromInput(['foo', 'bar']);

// When assigning a key to scalar input values:
$step->keepInputData('foo');
// change to
$step->keepInputAs('foo');

Crawler::addStep() Signature Change

Likelihood Of Impact: Medium

The Crawler::addStep() method signature has changed. Optionally it was possible to provide a result key as the first parameter, but this option has now been removed. When used, the key was passed to Step::addToResult() internally. You will now need to handle this manually. For example:

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

// Previously:
$crawler
    ->input('https://www.example.com/foo')
    ->addStep(Http::get())
    ->addStep(
        'some_url',
        Html::getLink('.something'),
    );

// Now:
$crawler
    ->input('https://www.example.com/foo')
    ->addStep(Http::get())
    ->addStep(
        Html::getLink('.something')
            ->keepAs('some_url')
    );

Changed LoadingStep class to trait and Loader Assignment

Likelihood Of Impact: Medium

This update only affects you if you have a custom loading step that extends the (removed) LoadingStep class and possibly even used the (undocumented) functionality for working with multiple different loaders.

The update involves two key changes:

  • The LoadingStep class has been refactored into a trait.
  • Returning multiple loaders as an array from Crawler::loader() is no longer supported.

From Class to Trait

The Crwlr\Crawler\Steps\Loading\LoadingStep class’s addLoader() method has been renamed to setLoader() in the new LoadingStep trait, retaining the same functionality. This should generally be irrelevant since the method is mainly intended for internal use only. A single primary loader should still be defined via the Crawler::loader() method.

When accessing the loader from within a custom loading step, use the getLoader() method instead of directly accessing the loader property, which is no longer allowed due to its visibility changing from protected to private in the trait. Therefore:

  • If you’ve redefined the loader property in your custom loading step, remove it. To narrow the loader type, use a generic type hint as shown here in the docs.
  • If you really need to set a loader from within the loading step class itself, use the setLoader() method. For external loader assignment, see the instructions further below.

The methods useLoader() and usesLoader() from the old class have been removed without replacement, leading to the second part of the change.

Assigning Different Loaders to Different Steps

Previously, you could return multiple loaders from Crawler::loader() as an array and specify which loader to use with $step->useLoader('foo'). This is no longer possible. Now, to assign a different loader to specific steps using the LoadingStep trait, you can use the withLoader() method to directly pass the loader instance.

Example:

Old (v1.x):

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends Crawler
{
    // userAgent() method here

    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
    {
        return [
            'http' => new HttpLoader($userAgent, logger: $logger),
            'ftp' => new MyCustomFtpLoader($userAgent, $logger),
        ];
    }
}

$crawler = new MyCrawler();

$crawler
    ->input('https://www.example.com/foo')
    ->addStep(Http::get()->useLoader('http')) // Use loader behind key 'http'.
    ->addStep(Html::getLink('.ftp_link'))
    ->addStep(MyFtpStep::fetch()->useLoader('ftp')); // Use 'ftp' loader.

New (v2.0):

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');

$ftpLoader = new MyCustomFtpLoader($crawler->getUserAgent(), $crawler->getLogger());

$crawler
    ->input('https://www.example.com/foo')
    ->addStep(Http::get()) // Just use the default HTTP loader.
    ->addStep(Html::getLink('.ftp_link'))
    ->addStep(MyFtpStep::fetch()->withLoader($ftpLoader)); // Usage of new ->withLoader() method.

And as a side effect of this change, the Crwlr\Crawler\Exceptions\UnknownLoaderKeyException exception class was also removed. If you’re referencing it in a catch statement, you can simply remove it.

Moved HttpLoader Methods

Likelihood Of Impact: Medium

Some deprecated methods have been moved from the Crwlr\Crawler\Loader\Http\HttpLoader to its browser helper dependency.

// Assuming $loader is an instance of HttpLoader.

// Change

$loader->setHeadlessBrowserOptions([/*...*/]);
// to
$loader->browser()->setOptions([/*...*/]);

$loader->addHeadlessBrowserOptions([/*...*/]);
// to
$loader->browser()->addOptions([/*...*/]);

$loader->setChromeExecutable('foo');
// to
$loader->browser()->setExecutable('foo');

$loader->browserHelper();
// to
$loader->browser();

Changed HttpLoader::retryCachedErrorResponses() Method

Likelihood Of Impact: Medium

The HttpLoader::retryCachedErrorResponses() method now returns an instance of the new Crwlr\Crawler\Loader\Http\Cache\RetryManager class, allowing more granular configuration. Previously, this method returned the HttpLoader itself ($this), so if you’ve been chaining it with other loader methods, you will need to refactor.

Example:

// Assuming $loader is an instance of HttpLoader.

// Change

$loader
    ->retryCachedErrorResponses()
    ->dontUseCookies();

// either to

$loader
    ->dontUseCookies()
    ->retryCachedErrorResponses();

// or

$loader->retryCachedErrorResponses();
$loader->dontUseCookies();

Removal of the addLaterToResult() Method

Likelihood Of Impact: Medium

This method should no longer be necessary. Here’s why: Previously, addToResult() would create a Result object at the step where it was called, and if the next step produced multiple outputs from a single input, all those outputs shared the same Result object. This meant that data from multiple outputs was combined into a single Result. If you wanted to delay creating a Result and instead keep data for all future outputs/results, you would use addLaterToResult().

However, the new keep methods work differently. They create Result objects only at the end of the crawling procedure, and just copy kept data to all outputs. Therefore, in examples like those in the v1.7 docs, you can replace addLaterToResult() with keep() (or keepAs() for steps producing scalar outputs).

// In this example, we're retrieving multiple books as separate Result objects,
// with the author name extracted from the author detail page, which leads to multiple
// book detail pages.

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/authors/patricia-highsmith')
    ->addStep(Http::get())
    ->addStep(
        Html::root()
            ->extract([
                'author' => 'h1',
                'bookUrls' => Dom::cssSelector('#author-data .books a.book')->link(),
            ])
            // Instead of creating a Result object here, we store the author name
            // to add it later to the individual book details.
            ->addLaterToResult(['author']) 
    )
    ->addStep(Http::get()->useInputKey('bookUrls'))
    ->addStep(
        Html::root()
            ->extract([/* book details like title, year, description */])
            // Now create the results and the previously stored author name is included.
            ->addToResult()
    );

If addToResult() had been used instead of addLaterToResult(), we would have ended up with one Result object per author, containing arrays of titles, years, and descriptions.

This example can now be changed to:

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/authors/patricia-highsmith')
    ->addStep(Http::get())
    ->addStep(
        Html::root()
            ->extract([
                'author' => 'h1',
                'bookUrls' => Dom::cssSelector('#author-data .books a.book')->link(),
            ])
            // This keep() call has no effect on the behavior of following steps,
            // besides passing on the author property until the end of the crawling
            // procedure.
            ->keep(['author'])
    )
    ->addStep(Http::get()->useInputKey('bookUrls'))
    ->addStep(
        Html::root()->extract([/* book details like title, year, description */])
    );

Changes for Custom Paginator Implementations

Likelihood Of Impact: Medium

The deprecated PaginatorInterface has been removed. Instead of implementing it, extend Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator. Be cautious, as an older deprecated version in Crwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator has also been removed.

Further changes in the Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator class:

  • The first argument UriInterface $url has been removed from the processLoaded() method, as the URL is also part of the request (Psr\Http\Message\RequestInterface), which is the new first argument.
  • The default implementation of getNextRequest() has been removed. Child implementations must define this method themselves.
  • If your custom paginator still includes a getNextUrl() method, note that it is no longer needed by the library and will not be called. The getNextRequest() method now fulfills its original purpose.

Moved Microseconds Util Class to crwlr/utils package

Likelihood Of Impact: Low

The deprecated Crwlr\Crawler\Loader\Http\Politeness\TimingUnits\Microseconds class has been removed. Use the version that is now part of the crwlr/utils package (Crwlr\Utils\Microseconds) instead.

Removal of the result and addLaterToResult Properties of Input and Output objects

Likelihood Of Impact: Low

Due to the removal of the aforementioned step methods (addToResult(),...) and the shift away from creating Result objects mid-crawling, these properties are now obsolete. Data kept by the new keep methods is stored in the keep property of Input and Output objects. However, direct access to this property should generally be unnecessary, as these objects are mostly used internally.

Removal of RespondedRequest::cacheKeyFromRequest()

Likelihood Of Impact: Very Low

If you've been using Crwlr\Crawler\Loader\Http\Messages\RespondedRequest::cacheKeyFromRequest(), you can use Crwlr\Crawler\Utils\RequestKey::from() from the crwlr/utils package instead.

Likelihood Of Impact: Very Low

The internal methods addsToOrCreatesResult() and createsResult() have been removed. They weren't documented and are intended for internal use only. Similarly, the new methods keepsAnything(), keepsAnythingFromInputData(), and keepsAnythingFromOutputData() are designed for library internals only and should not be necessary in your own code.

Changes to StepInterface

Likelihood Of Impact: Very Low

If you’ve built custom steps by directly implementing the Crwlr\Crawler\Steps\StepInterface without extending the Crwlr\Crawler\Steps\Step class (as recommended in the documentation), be aware that this interface has undergone significant changes. We strongly recommend switching to extend the Crwlr\Crawler\Steps\Step class for future compatibility and ease of use.

Removed Object Serialization Method

Likelihood Of Impact: Very Low

If you've built custom steps returning object outputs, in some situations the library needs to convert those objects to arrays. This was never mentioned in the documentation, but when converting, the library looks for conversion methods in the objects. One of those possible conversion method names was toArrayForAddToResult() which was removed. It can be replaced with toArrayForResult().