Documentation for crwlr / crawler-ext-browser (v1.4)

Taking a Screenshot

Basic Usage

The Screenshot step, as its name implies, captures an image of the page associated with a given URL. A basic example is demonstrated below:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

$myStorePath = __DIR__ . '/storepath';

$crawler
    ->input('https://www.example.com')
    ->addStep(Screenshot::loadAndTake($myStorePath));

$crawler->runAndDump();

Upon executing the crawler, the screenshot image is saved as a file in the specified storage path. The output of this step is a Crwlr\CrawlerExtBrowser\Aggregates\RespondedRequestWithScreenshot object. This class extends the RespondedRequest class from the crawler package, enabling access not only to the screenshot image but also to the response itself in subsequent steps or for adding it to the result.

Properties that can be added to the result are: screenshotPath and all request properties, namely url, status, headers and body.

Timeout

The default timeout in the chrome-php library is 30 seconds. If you want to specify a different duration, you can use the timeout() method in your step definition. This allows you to set the maximum amount of time the browser should wait for a page to load.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Screenshot::loadAndTake(__DIR__ . '/storepath')
            ->timeout(120.0) // Seconds.
    );

$crawler->runAndDump();

Combining Screenshot Capture and Data Extraction

As previously mentioned, the screenshot step produces objects that extend the RespondedRequest class. Consequently, subsequent steps can access all the response data as if an Http step was used. Below is an example crawler that captures a screenshot and then extracts the page title from the loaded page in the next step.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

$myStorePath = __DIR__ . '/storepath';

$crawler
    ->input('https://www.example.com')
    ->addStep(
        Screenshot::loadAndTake($myStorePath)
            ->addToResult(['url', 'screenshotPath'])
    )
    ->addStep(
        Html::metaData()
            ->only(['title'])
            ->addToResult()
    );

$crawler->runAndDump();

Customizing the Request

The step shares functionality with the HTTP step from the crawler package. Therefore, you can also send custom HTTP headers, decide how to handle error responses (using stopOnErrorResponse() or yieldErrorResponses()), and specify certain keys from the input to be used as the URL or for HTTP headers (using useInputKeyAsUrl() and useInputKeyAsHeader() or useInputKeyAsHeaders()).
Please note: It's not possible to instruct the browser to use a different method than GET, thus sending a request body is also not supported.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

$myStorePath = __DIR__ . '/storepath';

$crawler
    ->inputs([
        ['link' => 'https://www.example.com', 'someHeaderValue' => '123abc'],
        ['link' => 'https://example.com/error', 'someHeaderValue' => '123abc'],
    ])
    ->addStep(
        Screenshot::loadAndTake($myStorePath, ['x-some-header' => 'value'])
            ->useInputKeyAsUrl('link')
            ->useInputKeyAsHeader('someHeaderValue', 'x-header-value')
            ->yieldErrorResponses()
            ->addToResult(['url', 'status', 'screenshotPath'])
    );

$crawler->runAndDump();

Taking Screenshots when using the Http::crawl() Step

When using the Http::crawl() step, you might also want to capture screenshots of all the loaded pages. However, the Screenshot::loadAndTake() step expects URLs as input and handles loading those URLs, which is redundant since the crawl step already loads the pages.

To address this, the extension package provides the Screenshot::take() step. This step can be added after any Http step that loads a page using the headless browser. It takes an HTTP response (Crwlr\Crawler\Loader\Http\Messages\RespondedRequest object) as input, captures a screenshot of the page still open in the headless browser, and creates a Crwlr\CrawlerExtBrowser\Aggregates\RespondedRequestWithScreenshot object from the input object.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');

// Ensure the loader uses the headless browser also in the Http::crawl() step,
// as the Screenshot step requires an open page to capture a screenshot.
$crawler->getLoader()->useHeadlessBrowser();

$crawler
    ->input('https://www.example.com')
    ->addStep(Http::crawl())
    ->addStep(
        Screenshot::take(__DIR__ . '/storepath')
            ->keep(['url', 'screenshotPath'])
    );

$crawler->runAndDump();

Waiting After Page Load Before Taking a Screenshot

There might be instances when you prefer not to capture a screenshot immediately after the page was loaded, but instead wait for a certain amount of time (e.g. because you know that after the page was rendered something happens that you want to await). In such cases, you can use the waitAfterPageLoaded() method.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\CrawlerExtBrowser\Steps\Screenshot;

$crawler = HttpCrawler::make()->withUserAgent('MyCrawler');

$myStorePath = __DIR__ . '/storepath';

$crawler
    ->input('https://www.crwlr.software')
    ->addStep(
        Screenshot::loadAndTake($myStorePath)
            ->waitAfterPageLoaded(1.5)
            ->addToResult(['url', 'screenshotPath'])
    );

$crawler->runAndDump();