Upgrade from v1.x to 2.0
Removed Step Methods for Result Composition
Likelihood Of Impact: High
The deprecated methods addToResult()
, addLaterToResult()
, and keepInputData()
have been removed from the BaseStep
class. These calls can be replaced with the new keep methods:
// Change:
$step->addToResult();
// to
$step->keep();
// and
$step->addToResult(['foo', 'bar']);
// to
$step->keep(['foo', 'bar']);
// When assigning keys to scalar output values without key:
$step->addToResult('foo');
// change to
$step->keepAs('foo');
// When keeping data from step inputs:
$step->keepInputData();
// change to
$step->keepFromInput();
// This even became more flexible. You can now pick keys from the input:
$step->keepFromInput(['foo', 'bar']);
// When assigning a key to scalar input values:
$step->keepInputData('foo');
// change to
$step->keepInputAs('foo');
Crawler::addStep() Signature Change
Likelihood Of Impact: Medium
The Crawler::addStep()
method signature has changed. Optionally it was possible to provide a result key as the first parameter, but this option has now been removed. When used, the key was passed to Step::addToResult()
internally. You will now need to handle this manually. For example:
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
// Previously:
$crawler
->input('https://www.example.com/foo')
->addStep(Http::get())
->addStep(
'some_url',
Html::getLink('.something'),
);
// Now:
$crawler
->input('https://www.example.com/foo')
->addStep(Http::get())
->addStep(
Html::getLink('.something')
->keepAs('some_url')
);
Changed LoadingStep class to trait and Loader Assignment
Likelihood Of Impact: Medium
This update only affects you if you have a custom loading step that extends the (removed) LoadingStep
class and possibly even used the (undocumented) functionality for working with multiple different loaders.
The update involves two key changes:
- The
LoadingStep
class has been refactored into a trait. - Returning multiple loaders as an array from
Crawler::loader()
is no longer supported.
From Class to Trait
The Crwlr\Crawler\Steps\Loading\LoadingStep
class’s addLoader()
method has been renamed to setLoader()
in the new LoadingStep
trait, retaining the same functionality. This should generally be irrelevant since the method is mainly intended for internal use only. A single primary loader should still be defined via the Crawler::loader()
method.
When accessing the loader from within a custom loading step, use the getLoader()
method instead of directly accessing the loader
property, which is no longer allowed due to its visibility changing from protected to private in the trait. Therefore:
- If you’ve redefined the
loader
property in your custom loading step, remove it. To narrow the loader type, use a generic type hint as shown here in the docs. - If you really need to set a loader from within the loading step class itself, use the
setLoader()
method. For external loader assignment, see the instructions further below.
The methods useLoader()
and usesLoader()
from the old class have been removed without replacement, leading to the second part of the change.
Assigning Different Loaders to Different Steps
Previously, you could return multiple loaders from Crawler::loader()
as an array and specify which loader to use with $step->useLoader('foo')
. This is no longer possible. Now, to assign a different loader to specific steps using the LoadingStep
trait, you can use the withLoader()
method to directly pass the loader instance.
Example:
Old (v1.x):
use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;
class MyCrawler extends Crawler
{
// userAgent() method here
protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
{
return [
'http' => new HttpLoader($userAgent, logger: $logger),
'ftp' => new MyCustomFtpLoader($userAgent, $logger),
];
}
}
$crawler = new MyCrawler();
$crawler
->input('https://www.example.com/foo')
->addStep(Http::get()->useLoader('http')) // Use loader behind key 'http'.
->addStep(Html::getLink('.ftp_link'))
->addStep(MyFtpStep::fetch()->useLoader('ftp')); // Use 'ftp' loader.
New (v2.0):
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');
$ftpLoader = new MyCustomFtpLoader($crawler->getUserAgent(), $crawler->getLogger());
$crawler
->input('https://www.example.com/foo')
->addStep(Http::get()) // Just use the default HTTP loader.
->addStep(Html::getLink('.ftp_link'))
->addStep(MyFtpStep::fetch()->withLoader($ftpLoader)); // Usage of new ->withLoader() method.
And as a side effect of this change, the Crwlr\Crawler\Exceptions\UnknownLoaderKeyException
exception class was also removed. If you’re referencing it in a catch statement, you can simply remove it.
Moved HttpLoader Methods
Likelihood Of Impact: Medium
Some deprecated methods have been moved from the Crwlr\Crawler\Loader\Http\HttpLoader
to its browser helper dependency.
// Assuming $loader is an instance of HttpLoader.
// Change
$loader->setHeadlessBrowserOptions([/*...*/]);
// to
$loader->browser()->setOptions([/*...*/]);
$loader->addHeadlessBrowserOptions([/*...*/]);
// to
$loader->browser()->addOptions([/*...*/]);
$loader->setChromeExecutable('foo');
// to
$loader->browser()->setExecutable('foo');
$loader->browserHelper();
// to
$loader->browser();
Changed HttpLoader::retryCachedErrorResponses() Method
Likelihood Of Impact: Medium
The HttpLoader::retryCachedErrorResponses()
method now returns an instance of the new Crwlr\Crawler\Loader\Http\Cache\RetryManager
class, allowing more granular configuration. Previously, this method returned the HttpLoader
itself ($this
), so if you’ve been chaining it with other loader methods, you will need to refactor.
Example:
// Assuming $loader is an instance of HttpLoader.
// Change
$loader
->retryCachedErrorResponses()
->dontUseCookies();
// either to
$loader
->dontUseCookies()
->retryCachedErrorResponses();
// or
$loader->retryCachedErrorResponses();
$loader->dontUseCookies();
Removal of the addLaterToResult() Method
Likelihood Of Impact: Medium
This method should no longer be necessary. Here’s why: Previously, addToResult()
would create a Result
object at the step where it was called, and if the next step produced multiple outputs from a single input, all those outputs shared the same Result
object. This meant that data from multiple outputs was combined into a single Result
. If you wanted to delay creating a Result
and instead keep data for all future outputs/results, you would use addLaterToResult()
.
However, the new keep methods work differently. They create Result
objects only at the end of the crawling procedure, and just copy kept data to all outputs. Therefore, in examples like those in the v1.7 docs, you can replace addLaterToResult()
with keep()
(or keepAs()
for steps producing scalar outputs).
// In this example, we're retrieving multiple books as separate Result objects,
// with the author name extracted from the author detail page, which leads to multiple
// book detail pages.
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/authors/patricia-highsmith')
->addStep(Http::get())
->addStep(
Html::root()
->extract([
'author' => 'h1',
'bookUrls' => Dom::cssSelector('#author-data .books a.book')->link(),
])
// Instead of creating a Result object here, we store the author name
// to add it later to the individual book details.
->addLaterToResult(['author'])
)
->addStep(Http::get()->useInputKey('bookUrls'))
->addStep(
Html::root()
->extract([/* book details like title, year, description */])
// Now create the results and the previously stored author name is included.
->addToResult()
);
If addToResult()
had been used instead of addLaterToResult()
, we would have ended up with one Result
object per author, containing arrays of titles, years, and descriptions.
This example can now be changed to:
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/authors/patricia-highsmith')
->addStep(Http::get())
->addStep(
Html::root()
->extract([
'author' => 'h1',
'bookUrls' => Dom::cssSelector('#author-data .books a.book')->link(),
])
// This keep() call has no effect on the behavior of following steps,
// besides passing on the author property until the end of the crawling
// procedure.
->keep(['author'])
)
->addStep(Http::get()->useInputKey('bookUrls'))
->addStep(
Html::root()->extract([/* book details like title, year, description */])
);
Changes for Custom Paginator Implementations
Likelihood Of Impact: Medium
The deprecated PaginatorInterface
has been removed. Instead of implementing it, extend Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator
. Be cautious, as an older deprecated version in Crwlr\Crawler\Steps\Loading\Http\Paginators\AbstractPaginator
has also been removed.
Further changes in the Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator
class:
- The first argument
UriInterface $url
has been removed from theprocessLoaded()
method, as the URL is also part of the request (Psr\Http\Message\RequestInterface
), which is the new first argument. - The default implementation of
getNextRequest()
has been removed. Child implementations must define this method themselves. - If your custom paginator still includes a
getNextUrl()
method, note that it is no longer needed by the library and will not be called. ThegetNextRequest()
method now fulfills its original purpose.
Moved Microseconds Util Class to crwlr/utils package
Likelihood Of Impact: Low
The deprecated Crwlr\Crawler\Loader\Http\Politeness\TimingUnits\Microseconds
class has been removed. Use the version that is now part of the crwlr/utils
package (Crwlr\Utils\Microseconds
) instead.
Removal of the result and addLaterToResult Properties of Input and Output objects
Likelihood Of Impact: Low
Due to the removal of the aforementioned step methods (addToResult()
,...) and the shift away from creating Result objects mid-crawling, these properties are now obsolete. Data kept by the new keep methods is stored in the keep
property of Input
and Output
objects. However, direct access to this property should generally be unnecessary, as these objects are mostly used internally.
Removal of RespondedRequest::cacheKeyFromRequest()
Likelihood Of Impact: Very Low
If you've been using Crwlr\Crawler\Loader\Http\Messages\RespondedRequest::cacheKeyFromRequest()
, you can use Crwlr\Crawler\Utils\RequestKey::from()
from the crwlr/utils
package instead.
Removal of Internal Methods Related to addToResult()
Likelihood Of Impact: Very Low
The internal methods addsToOrCreatesResult()
and createsResult()
have been removed. They weren't documented and are intended for internal use only. Similarly, the new methods keepsAnything()
, keepsAnythingFromInputData()
, and keepsAnythingFromOutputData()
are designed for library internals only and should not be necessary in your own code.
Changes to StepInterface
Likelihood Of Impact: Very Low
If you’ve built custom steps by directly implementing the Crwlr\Crawler\Steps\StepInterface
without extending the Crwlr\Crawler\Steps\Step
class (as recommended in the documentation), be aware that this interface has undergone significant changes. We strongly recommend switching to extend the Crwlr\Crawler\Steps\Step
class for future compatibility and ease of use.
Removed Object Serialization Method
Likelihood Of Impact: Very Low
If you've built custom steps returning object outputs, in some situations the library needs to convert those objects to arrays. This was never mentioned in the documentation, but when converting, the library looks for conversion methods in the objects. One of those possible conversion method names was toArrayForAddToResult()
which was removed. It can be replaced with toArrayForResult()
.