Documentation for crwlr / crawler (v3.2)

Loaders

Loaders are an essential part of this library, responsible for retrieving resources. By default, the Crwlr\Crawler\HttpCrawler creates an instance of the Crwlr\Crawler\Loader\Http\HttpLoader with default settings and automatically passes it to all loading steps (those using the Crwlr\Crawler\Steps\Loading\LoadingStep trait).

Accessing the Loader

There are several ways to access the loader instance of your crawler, or even provide your own custom loader.

Crawler::getLoader()

When using the HttpCrawler::make() shortcut method to obtain a crawler instance, you can easily access and customize the created loader via the Crawler::getLoader() method.

$crawler = HttpCrawler::make()->withUserAgent('MyCrawler');

$loader = $crawler->getLoader();

// Customize loader settings here.

Inside Custom Crawler Class

If you extend the HttpCrawler, you can override the loader() method, call the parent::loader() method to get the default HttpLoader instance, and then customize it.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = parent::loader($userAgent, $logger);

        // Customize loader settings here.

        return $loader;
    }

    // Define user agent
}

Alternatively, you can make your crawler use your own custom loader instance.

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends Crawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        return new MyLoader($userAgent, $logger);
    }

    // Define user agent
}

From a Loading Step

When building a custom loading step using the Crwlr\Crawler\Steps\Loading\LoadingStep trait, you can access the loader within the step via the $this->getLoader() method.

For instance, if you want your step to use the headless browser for loading, even though the crawler’s loader is configured to use the (Guzzle) HTTP client, you can switch to the headless browser just for the step invocation and switch back afterward.

use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Steps\Loading\LoadingStep;
use Crwlr\Crawler\Steps\Step;

class SomeLoadingStep extends Step
{
    /** @use LoadingStep<HttpLoader>  Generic type hint. */
    use LoadingStep;

    public function outputType(): StepOutputType
    {
        return StepOutputType::AssociativeArrayOrObject;
    }

    protected function invoke(mixed $input): Generator
    {
        // If supported, your IDE and static analysis will know that getLoader()
        // returns an instance of the HttpLoader, because of the @use phpdoc at the top of the class.
        // See: https://phpstan.org/blog/generics-in-php-using-phpdocs#class-level-generics

        $previouslyUsedBrowser = $this->getLoader()->usesHeadlessBrowser();

        if (!$previouslyUsedBrowser) { // Switch to using the headless browser.
            $this->getLoader()->useHeadlessBrowser();
        }

        // Load the input URL and yield the response.
        yield $this->getLoader()->load(new Request('GET', $input));

        if (!$previouslyUsedBrowser) { // Switch back to using the (Guzzle) HTTP client.
            $this->getLoader()->useHttpClient();
        }
    }
}
Note: This example is only meant to demonstrate how to access the loader within a class using the LoadingStep trait. If you want to implement switching between using the headless browser and the HTTP client at the beginning and end of a step invocation, take a look at our browser extension package, where we have implemented this exact functionality for easy reuse.

The HttpLoader

The package currently includes one loader: the Crwlr\Crawler\Loader\Http\HttpLoader. It offers several methods that can be used to customize its behavior.

The following code examples assume that $loader is an instance of the Crwlr\Crawler\Loader\Http\HttpLoader class. To learn how to obtain your crawler’s loader instance, see here.

You can customize the behavior of the loader regarding cookies.

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

// If you want to flush all previously saved cookies.
// This is probably mainly useful inside a custom step.
$loader->flushCookies();

// or

// If you don't want your crawler to use cookies at all.
$loader->dontUseCookies();

Max Redirects

You can set the maximum number of redirects.

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader->setMaxRedirects(15);

Using a Headless Browser to load pages (Execute Javascript)

You can make the HttpLoader use a headless browser to load pages by calling the useHeadlessBrowser() method. This method utilizes the chrome-php/chrome library under the hood, so you need to have Chrome or Chromium installed on your system.

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader->useHeadlessBrowser();

If you need to provide the chrome-php browser factory with the name of your Chrome executable, or some customization options, you can use the methods $loader->browser()->setExecutable(), $loader->browser()->setOptions() and $loader->browser()->addOptions():

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader->useHeadlessBrowser();

$loader->browser()->setExecutable('chromium');

$loader->browser()->setOptions([
    'windowSize' => [1024, 800],
    'enableImages' => false,
]);

// or
$loader->browser()->addOptions(['noSandbox' => true]);

By default, the browser uses the user-agent that you specified for your crawler. To have the browser send its native user-agent instead, call $loader->browser()->useNativeUserAgent().

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader->useHeadlessBrowser();

$loader->browser()->useNativeUserAgent();

If you want to run some JavaScript code on each new browser page, use $loader->browser()->setPageInitScript():

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader->useHeadlessBrowser();

$loader->browser()->setPageInitScript('window.foo = \'bar\'');

You can also configure the browser to wait for a specific event until the request is finished (available options can be found in the chrome-php readme), as well as the time to wait until a timeout error is triggered (the default is 30 seconds).

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader
    ->browser()
    ->waitForNavigationEvent(Page::DOM_CONTENT_LOADED)
    ->setTimeout(60_000); // 60 seconds.
The chrome-php library offers a lot of further functionality, such as taking screenshots, scrolling, clicking on elements, and more. The post browser navigate hook feature in Http steps allows you to interact with a loaded page after browser navigation and before reading the HTML source code. For even more advanced headless browser features, please check out our browser extension package.

Loader Events

The abstract Crwlr\Crawler\Loader\Loader class provides methods to register callback functions for specific events, which are triggered by the HttpLoader whenever they occur. The available events are: beforeLoad, onCacheHit, onSuccess, onError and afterLoad. These events can be very helpful, for instance, if you want to track the number of requests sent during your entire crawling procedure and how many of them received successful responses. Here’s how you can hook into these events:

use Crwlr\Crawler\Loader\Http\HttpLoader;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;

/** @var HttpLoader $loader */

$loader->beforeLoad(function (RequestInterface $request) {
    // Called before sending a request.
});

$loader->onCacheHit(function (RequestInterface $request, ResponseInterface $response) {
    // Called when a response for the request is found in the cache.
});

$loader->onSuccess(function (RequestInterface $request, ResponseInterface $response) {
    // Called when a successful response is returned.
});

$loader->onError(function (RequestInterface $request, ResponseInterface $response) {
    // Called when an error response is returned.
    // Won't be called when using the loadOrFail() method.
});

$loader->afterLoad(function (RequestInterface $request) {
    // Called after loading a request, regardless of success or error.
    // Won't be called when using the loadOrFail() method.
});

Using Proxy Servers

If you want your loader to use proxy servers, you can utilize the HttpLoader::useProxy() and HttpLoader::useRotatingProxies() methods. With rotating proxies, the loader will automatically switch to the next proxy in the provided array for each subsequent request.

use Crwlr\Crawler\Loader\Http\HttpLoader;

/** @var HttpLoader $loader */

$loader->useProxy('http://1.2.3.4:8084');

// or

$loader->useRotatingProxies([
    'http://2.3.4.5:8085',
    'http://3.4.5.6:8086',
    'http://4.5.6.7:8087',
]);

Building a Custom Loader

If you want to build a custom loader for your crawler, such as an FTP loader, you can do so by implementing the Crwlr\Crawler\Loader\LoaderInterface or by extending the Crwlr\Crawler\Loader\Loader class, which provides some base functionality.

The following example is untested and may not work as-is; it serves only to illustrate how you can build a custom loader. If you are genuinely interested in an FTP loader, let us know on Twitter, Github, or via the contact form.

use Crwlr\Crawler\Loader\Loader;
use Crwlr\Crawler\Logger\CliLogger;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class FtpLoader extends Loader
{
    public function __construct(
        private string $server,
        private string $user,
        private string $password,
        private string $localBasePath,
        UserAgentInterface $userAgent,
        ?LoggerInterface $logger = null,
    ) {
        parent::__construct($userAgent, $logger ?? new CliLogger());
    }

    public function load(mixed $subject): mixed
    {
        $ftp = ftp_connect($this->server);

        ftp_login($ftp, $this->user, $this->password);

        $splitFilePath = explode('/', $subject);

        $fileName = end($splitFilePath);

        if (ftp_get($ftp, $this->localBasePath . '/' . $fileName, $subject)) {
            $this->logger->info('Loaded file ' . $subject);

            yield $this->localBasePath . '/' . $fileName;
        } else {
            $this->logger->error('Failed to load file ' . $subject);
        }

        ftp_close($ftp);
    }

    public function loadOrFail(mixed $subject): mixed
    {
        // Same as load(), but throw an exception if loading fails.
    }
}

To use it in your crawler, you need to create a custom crawler class:

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyFtpCrawler extends Crawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        return new FtpLoader('ftp://some.example.com', 'foo', 'bar', '/my/local/path', $userAgent);
    }

    protected function userAgent(): UserAgentInterface
    {
        return new UserAgent('FtpCrawler');
    }
}

The protected loader() method of a crawler is called only once in the constructor, and the loader it returns is automatically passed on to every step that has an addLoader() method.

Assigning Different Loaders to Specific Steps

If not all steps in the crawling procedure should use the same loader, you can assign a different loader instance to a loading step (a step using the LoadingStep trait) with the withLoader() method. An example:

$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');

$crawler
    ->input('https://www.example.com/foo')
    ->addStep(Http::get())
    ->addStep(Html::each('#list .item')->extract([
        'title' => 'h3',
        'ftp_uri' => 'a.ftp-link',
    ]))
    ->addStep(
        MyCustomFtpLoadingStep::fetch()
            ->useInputKey('ftp_uri')
            ->withLoader(new MyCustomFtpLoader())
    );