Upgrade from v2.x to 3.0
The primary change in version 3.0.0 is that the library now leverages PHP 8.4’s new DOM API when used in an environment with PHP >= 8.4. To maintain compatibility with PHP < 8.4, an abstraction layer has been implemented. This layer dynamically uses either the Symfony DomCrawler component or the new DOM API, depending on the PHP version.
Since no direct interaction with an instance of the Symfony DomCrawler library was required at the step level provided by the library, it is highly likely that you won’t need to make any changes to your code to upgrade to v3. To ensure a smooth transition, please review the following things.
Removal of DomQuery::innerText()
Likelihood Of Impact: Medium
The Crwlr\Crawler\Steps\Html\DomQuery::innerText()
method (used in Crwlr\Crawler\Steps\Dom::cssSelector('...')
/ Crwlr\Crawler\Steps\Dom::xPath('...')
) has been removed. innerText
exists only in the Symfony DomCrawler component, and its usefulness is questionable. If you still require this variant of the DOM element text, please let us know or create a pull request yourself. Thank you!
use Crwlr\Crawler\Steps\Dom;
// The innerText() method is gone.
Dom::cssSelector('article h1')->innerText();
Dom::xPath('//bookstore/book/title')->innerText();
// Either replace it with ->text() or tell us if you still need it.
Changed Argument Type in Http::crawl()->customFilter() Closure
Likelihood Of Impact: Medium
The second argument in Closure
s passed to the Crwlr\Crawler\Steps\Loading\Http::crawl()->customFilter()
method has changed from an instance of the Symfony Crawler
class, to an Crwlr\Crawler\Steps\Dom\HtmlElement
instance from the new DOM abstraction.
use Crwlr\Crawler\Steps\Dom\HtmlElement;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->customFilter(function (Url $url, ?HtmlElement $linkElement) {
// The $linkElement argument changed from symfony Crawler to HtmlElement
return $linkElement && str_contains($linkElement->innerText(), 'Foo');
})
);
Move Filter Class Functionality to AbstractFilter
Likelihood Of Impact: Medium
The actual functionality from the Crwlr\Crawler\Steps\Filters\Filter
class was moved to Crwlr\Crawler\Steps\Filters\AbstractFilter
. The only methods remaining in the Crwlr\Crawler\Steps\Filters\Filter
class are the static functions providing the filter object instances. The reason for this is, that otherwise each filter class also has all the static methods.
So, if you have implemented your own custom filter classes, make them extend Crwlr\Crawler\Steps\Filters\AbstractFilter
instead of Crwlr\Crawler\Steps\Filters\Filter
.
use Crwlr\Crawler\Steps\Filters\AbstractFilter;
class MyFilter extends AbstractFilter // previously extended Filter
{
// ...
}
Visibility change of DomQuery::filter()
Likelihood Of Impact: Low
The visibility of the Crwlr\Crawler\Steps\Html\DomQuery::filter()
method was changed from public to protected. It is still needed in the Crwlr\Crawler\Steps\Html\DomQuery
class, but outside of it, it is probably better and easier to directly use the new DOM abstraction.
use Crwlr\Crawler\Steps\Dom;
use Symfony\Component\DomCrawler\Crawler;
// $html ist a string containing an HTML document.
// Change occurrences like this:
$domCrawler = new Crawler($html);
$domCrawlerNodeList = Dom::cssSelector('#list .item')->filter($domCrawler);
// to:
$document = new Dom\HtmlDocument($html);
$nodeList = $document->querySelectorAll('#list .item');
If you are extending the Crwlr\Crawler\Steps\Html\DomQuery
class (which is not recommended), be aware that the argument of the filter()
method now takes a Crwlr\Crawler\Steps\Dom\Node
(from the new DOM abstraction) instead of a Symfony Crawler
.
Method Signature Changes - Replace Symfony Crawler with new DOM Abstraction
Likelihood Of Impact: Low
The signatures of some methods that are mainly here for internal usage, have changed due to the new DOM abstraction:
- The static
Crwlr\Crawler\Steps\Html\GetLink::isSpecialNonHttpLink()
method now needs an instance ofCrwlr\Crawler\Steps\Dom\HtmlElement
instead of a SymfonyCrawler
. Crwlr\Crawler\Steps\Sitemap\GetUrlsFromSitemap::fixUrlSetTag()
now takes anCrwlr\Crawler\Steps\Dom\XmlDocument
instead of a SymfonyCrawler
.- The
Crwlr\Crawler\Steps\Html\DomQuery::apply()
method now takes aCrwlr\Crawler\Steps\Dom\Node
instead of a SymfonyCrawler
.
DOM Validation Method Replacement
Likelihood Of Impact: Low
The Crwlr\Crawler\Steps\Step::validateAndSanitizeToDomCrawlerInstance()
method was removed. Please use the Crwlr\Crawler\Steps\Step::validateAndSanitizeToHtmlDocumentInstance()
and Crwlr\Crawler\Steps\Step::validateAndSanitizeToXmlDocumentInstance()
methods instead.
use Crwlr\Crawler\Steps\Step;
class MyStep extends Step
{
// Other methods here.
protected function validateAndSanitizeInput(mixed $input): mixed
{
// Previously: $this->validateAndSanitizeToDomCrawlerInstance($input);
return $this->validateAndSanitizeToHtmlDocumentInstance($input);
// or:
return $this->validateAndSanitizeToXmlDocumentInstance($input);
}
}
Removal of the DomQueryInterface
Likelihood Of Impact: Low
The Crwlr\Crawler\Steps\Html\DomQueryInterface
was removed. As the Crwlr\Crawler\Steps\Html\DomQuery
class offers a lot more functionality than the interface defines, the purpose of the interface was questionable. Please use the abstract Crwlr\Crawler\Steps\Html\DomQuery
class instead. This also means that some method signatures, type hinting the interface, have changed. Look for occurrences of DomQueryInterface
in your code and replace them if there are any.