Loaders
Loaders are a very essential part of this library. As the name implies they are in charge of loading resources. The package is shipped with two loaders: the HttpLoader
and the PoliteHttpLoader
. But you can also write your own loaders, you just have to implement the LoaderInterface
.
use Crwlr\Crawler\Loader\LoaderInterface;
class MyLoader implements LoaderInterface
{
public function __construct(private UserAgentInterface $userAgent, private LoggerInterface $logger)
{
}
public function load(mixed $subject): mixed
{
// Load something, in case it fails return null.
}
public function loadOrFail(mixed $subject): mixed
{
// Load something, in case it fails throw an exception.
}
}
To use it in your crawler add:
class MyCrawler extends Crawler
{
protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
{
return new MyLoader($userAgent, $logger);
}
// define user agent
}
The way to add a loader to the crawler is via the protected
loader()
method. It's called only once in the constructor
of the Crawler class, and then it's automatically passed on
to every step that has an addLoader
method.
HttpLoader
The HttpLoader
needs an implementation of the
PSR-18 ClientInterface.
As default it uses the guzzle client
but you can extend the class and use a different
implementation if you want.
Sometimes crawling a page requires having some cookies
a page sends you via HTTP response headers. As PSR-18
clients don't persist cookies themselves, the
HttpLoader
has its own cookie jar. If your crawler
shall not use cookies, you can deactivate it:
$loader = new HttpLoader();
$loader->dontUseCookies();
When you build your own loading step and the loader should
at some point forget all the cookies it has persisted until
now, you can access the loader via $this->loader
and
flush the cookie jar:
$this->loader->flushCookies();
Using a Headless Browser to load pages (Execute Javascript)
It's also possible to make the HTTP loader classes use a headless browser to load pages by calling the useHeadlessBrowser()
method. Under the hood it then uses the chrome-php/chrome library to do so. So you need to have chrome/chromium installed on your system.
class MyCrawler extends HttpCrawler
{
protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
{
$loader = new PoliteHttpLoader($userAgent, logger: $logger);
$loader->useHeadlessBrowser();
return $loader;
}
// ...
}
If you need to provide the chrome-php browser factory with some customization options, you can use the methods setHeadlessBrowserOptions()
and addHeadlessBrowserOptions()
:
class MyCrawler extends HttpCrawler
{
protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
{
$loader = new PoliteHttpLoader($userAgent, logger: $logger);
$loader->useHeadlessBrowser();
$loader->setHeadlessBrowserOptions([
'windowSize' => [1024, 800],
'enableImages' => false,
]);
// or
$loader->addHeadlessBrowserOptions([
'noSandbox' => true,
]);
return $loader;
}
// ...
}
You could also call it from within a LoadingStep
, so only that step will use the browser. In that case don't forget to call the useHttpClient()
method to revert that setting in the Loader.
class SomeLoadingStep extends LoadingStep
{
protected function invoke(mixed $input): Generator
{
$this->loader->useHeadlessBrowser();
yield $this->loader->load(new Request('GET', $input));
$this->loader->useHttpClient();
}
}
The chrome-php library ships with a lot of further functionality like scrolling and clicking on elements. This feature of the HTTP loader classes is just intended to get source code after javascript was executed in the browser. But you can use the chrome-php library yourself in custom steps to use those features.
PoliteHttpLoader
This loader just extends the HttpLoader
and uses two
traits:
CheckRobotsTxt
Get the robots.txt and stick to its rules (this also means
that this loader works only when you're using a
BotUserAgent
in your crawler)
WaitPolitely
Wait a little between two requests. The wait time depends on how long the latest request took to be answered. This means if the server starts to respond slower, the crawler also waits longer between requests.
If you don't want to use a BotUserAgent
in your crawler
but you would like to use this feature, just make a
Loader like this:
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\Http\Traits\WaitPolitely;
class MyLoader extends HttpLoader
{
use WaitPolitely;
}