Documentation for crwlr / crawler (v1.0)

Attention: You're currently viewing the documentation for v1.0 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Loaders

Loaders are a very essential part of this library. As the name implies they are in charge of loading resources. The package is currently shipped with one loader: the HttpLoader. But you can also write your own loaders, you just have to implement the LoaderInterface.

use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyLoader implements LoaderInterface
{
    public function __construct(private UserAgentInterface $userAgent, private LoggerInterface $logger)
    {
    }

    public function load(mixed $subject): mixed
    {
        // Load something, in case it fails return null.
    }

    public function loadOrFail(mixed $subject): mixed
    {
        // Load something, in case it fails throw an exception.
    }
}

To use it in your crawler add:

use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends Crawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
    {
        return new MyLoader($userAgent, $logger);
    }

    // define user agent
}

The way to add a loader to the crawler is via the protected loader() method. It's called only once in the constructor of the Crawler class, and then it's automatically passed on to every step that has an addLoader method.

HttpLoader

The HttpLoader needs an implementation of the PSR-18 ClientInterface. By default, it uses the Guzzle client, but you can extend the class and use a different implementation if you want.

Sometimes crawling a page requires having some cookies a page sends you via HTTP response headers. As PSR-18 clients don't persist cookies themselves, the HttpLoader has its own cookie jar. If your crawler shall not use cookies, you can deactivate it:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
    {
        $loader = new HttpLoader();

        $loader->dontUseCookies();

        return $loader;
    }

    // define user agent
}

When you build your own loading step and the loader should, at some point, forget all the cookies it has persisted until now, you can access the loader via $this->loader and flush the cookie jar:

$this->loader->flushCookies();

Another thing you can customize is the maximum amount of redirects the loader will follow. The default is 10.

$loader->setMaxRedirects(15);

Using a Headless Browser to load pages (Execute Javascript)

It's also possible to make the HttpLoader class use a headless browser to load pages by calling the useHeadlessBrowser() method. Under the hood it then uses the chrome-php/chrome library to do so. So you need to have chrome/chromium installed on your system.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
    {
        $loader = new HttpLoader($userAgent, logger: $logger);

        $loader->useHeadlessBrowser();

        return $loader;
    }

    // define user agent
}

If you need to provide the chrome-php browser factory with the name of your chrome executable, or some customization options, you can use the methods setChromeExecutable, setHeadlessBrowserOptions() and addHeadlessBrowserOptions():

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
    {
        $loader = new HttpLoader($userAgent, logger: $logger);

        $loader->useHeadlessBrowser();

        $loader->setChromeExecutable('chromium');

        $loader->setHeadlessBrowserOptions([
            'windowSize' => [1024, 800],
            'enableImages' => false,
        ]);

        // or 
        $loader->addHeadlessBrowserOptions([
            'noSandbox' => true,
        ]);

        return $loader;
    }

    // define user agent
}

You could also call it from within a LoadingStep, so only that step will use the browser. In that case don't forget to call the useHttpClient() method to revert that setting in the Loader.

use Crwlr\Crawler\Steps\Loading\LoadingStep;
use GuzzleHttp\Psr7\Request;

class SomeLoadingStep extends LoadingStep
{
    protected function invoke(mixed $input): Generator
    {
        $this->loader->useHeadlessBrowser();

        yield $this->loader->load(new Request('GET', $input));

        $this->loader->useHttpClient();
    }
}

The chrome-php library ships with a lot of further functionality like scrolling and clicking on elements. This feature of the HTTP loader classes is just intended to get source code after javascript was executed in the browser. But you can use the chrome-php library yourself in custom steps to use those features.

Loader Events

The abstract Crwlr\Crawler\Loader\Loader class provides methods to register callback functions for specific events, which are called by the HttpLoader whenever they occur. The available events are: beforeLoad, onSuccess, onError and afterLoad. These events can be very helpful, for instance, if you want to track the number of requests sent during your entire crawling procedure and how many of them received successful responses. Here's how you can hook into these events:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface|array
    {
        $loader = new HttpLoader($userAgent, logger: $logger);

        $loader->beforeLoad(function (RequestInterface $request) {
            // Called before sending a request.
        });

        $loader->onSuccess(function (RequestInterface $request, ResponseInterface $response) {
            // Called when a success response was returned.
        });

        $loader->onError(function (RequestInterface $request, ResponseInterface $response) {
            // Called when an error response was returned.
            // Won't be called when using loadOrFail() method.
        });

        $loader->afterLoad(function (RequestInterface $request) {
            // Called after loading a request, no matter if response was success or error.
            // Won't be called when using loadOrFail() method.
        });

        return $loader;
    }

    // define user agent
}