Repo on github

Documentation for crwlr / crawler-ext-browser (v2.2)

Infinite Scrolling

The InfiniteScrolling step automates scrolling on a web page loaded in a (headless) browser. It scrolls down the page, waiting 1–2 seconds after each step, until no further scrolling is possible. At the end of the scrolling process, the step extracts the page’s HTML source code and outputs the response (RespondedRequest).

Basic Usage

Suppose https://www.example.com/listing provides a list of items that are dynamically loaded via JavaScript as you scroll down the page, until all items are fully loaded and rendered. Here’s the code to handle this scenario:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\CrawlerExtBrowser\Steps\InfiniteScrolling;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

$crawler
    ->input('https://www.example.com/listing')
    ->addStep(new InfiniteScrolling())
    ->addStep(
        Html::each('#list .item')->extract([
            'title' => '.title',
            'url' => Dom::cssSelector('a.detail-link')->link(),
        ])
    );

$crawler->runAndDump();

Options

InfiniteScrolling::dontFailOnScrollDistanceZero()

By default, the step throws an exception if it cannot scroll down even once on a page. If this behavior is acceptable for the pages you are processing, you can enable this option, and the step will log a warning instead of throwing an exception when it fails to scroll down.

$step = new InfiniteScrolling();

$step->dontFailOnScrollDistanceZero();

InfiniteScrolling::maxRetries(int $retries)

Scrolling down may fail for various reasons. By default, the step retries twice before giving up. Use this method to set the maximum number of retries:

$step = new InfiniteScrolling();

$step->maxRetries(1);

InfiniteScrolling::useOpenedPage()

By default, the step takes care of loading the URL provided as input. However, if you want to scroll on a page that was already opened in the browser by a previous step, you can enable this option. In this scenario, the complete example from above might look like this:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\CrawlerExtBrowser\Steps\InfiniteScrolling;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

$crawler->getLoader()->useHeadlessBrowser();

$crawler
    ->input('https://www.example.com/listing')
    ->addStep(Http::get())
    ->addStep(
        (new InfiniteScrolling())->useOpenedPage()
    )
    ->addStep(
        Html::each('#list .item')->extract([
            'title' => '.title',
            'url' => Dom::cssSelector('a.detail-link')->link(),
        ])
    );

$crawler->runAndDump();