Documentation for crwlr / crawler (v0.6)

Attention: You're currently viewing the documentation for v0.6 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Loaders

Loaders are a very essential part of this library. As the name implies they are in charge of loading resources. The package is currently shipped with one loader: the HttpLoader. But you can also write your own loaders, you just have to implement the LoaderInterface.

use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyLoader implements LoaderInterface
{
    public function __construct(private UserAgentInterface $userAgent, private LoggerInterface $logger)
    {
    }

    public function load(mixed $subject): mixed
    {
        // Load something, in case it fails return null.
    }

    public function loadOrFail(mixed $subject): mixed
    {
        // Load something, in case it fails throw an exception.
    }
}

To use it in your crawler add:

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends Crawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        return new MyLoader($userAgent, $logger);
    }

    // define user agent
}

The way to add a loader to the crawler is via the protected loader() method. It's called only once in the constructor of the Crawler class, and then it's automatically passed on to every step that has an addLoader method.

HttpLoader

The HttpLoader needs an implementation of the PSR-18 ClientInterface. By default, it uses the Guzzle client, but you can extend the class and use a different implementation if you want.

Sometimes crawling a page requires having some cookies a page sends you via HTTP response headers. As PSR-18 clients don't persist cookies themselves, the HttpLoader has its own cookie jar. If your crawler shall not use cookies, you can deactivate it:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = new HttpLoader();

        $loader->dontUseCookies();

        return $loader;
    }

    // define user agent
}

When you build your own loading step and the loader should, at some point, forget all the cookies it has persisted until now, you can access the loader via $this->loader and flush the cookie jar:

$this->loader->flushCookies();

Using a Headless Browser to load pages (Execute Javascript)

It's also possible to make the HttpLoader class use a headless browser to load pages by calling the useHeadlessBrowser() method. Under the hood it then uses the chrome-php/chrome library to do so. So you need to have chrome/chromium installed on your system.

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = new HttpLoader($userAgent, logger: $logger);

        $loader->useHeadlessBrowser();

        return $loader;
    }

    // define user agent
}

If you need to provide the chrome-php browser factory with some customization options, you can use the methods setHeadlessBrowserOptions() and addHeadlessBrowserOptions():

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = new HttpLoader($userAgent, logger: $logger);

        $loader->useHeadlessBrowser();

        $loader->setHeadlessBrowserOptions([
            'windowSize' => [1024, 800],
            'enableImages' => false,
        ]);

        // or 
        $loader->addHeadlessBrowserOptions([
            'noSandbox' => true,
        ]);

        return $loader;
    }

    // define user agent
}

You could also call it from within a LoadingStep, so only that step will use the browser. In that case don't forget to call the useHttpClient() method to revert that setting in the Loader.

use Crwlr\Crawler\Steps\Loading\LoadingStep;
use GuzzleHttp\Psr7\Request;

class SomeLoadingStep extends LoadingStep
{
    protected function invoke(mixed $input): Generator
    {
        $this->loader->useHeadlessBrowser();

        yield $this->loader->load(new Request('GET', $input));

        $this->loader->useHttpClient();
    }
}

The chrome-php library ships with a lot of further functionality like scrolling and clicking on elements. This feature of the HTTP loader classes is just intended to get source code after javascript was executed in the browser. But you can use the chrome-php library yourself in custom steps to use those features.