Documentation for crwlr / crawler (v3.2)

Response Cache

You can add an instance of the PSR-16 CacheInterface to your crawler's loader, to cache loaded responses.

A response cache may be useful in different situations:

  • During development of a crawler, so you don't unnecessarily send (and wait for) the same HTTP requests again and again.
  • When you're having long-running crawlers that load a lot of pages, where it would be very frustrating if the crawler fails at some point for some reason, and you have to start from zero.

The package ships with one simple implementation that caches responses as files in a directory on your filesystem. You can add a cache to your crawler's loader using the setCache() method.

use Crwlr\Crawler\Cache\FileCache;
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return new BotUserAgent('MyCrawler');
    }

    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = new HttpLoader($userAgent, logger: $logger);

        $loader->setCache(new FileCache(__DIR__ . '/cachedir'));

        return $loader;
    }
}

The loader will then cache every loaded resource and if it's requested again, use the cached version, if it's not expired yet.

Time to live

By default, it caches responses for one hour. But you can set your own time to live.

use Crwlr\Crawler\Cache\FileCache;

$cache = new FileCache(__DIR__ . '/cachedir');

// You can provide the time to live in seconds as integer
$cache->ttl(86400);

// or as a DateInterval object
$cache->ttl(new DateInterval('P2D'));

As the cache key the library creates a hash from the request method, URL, headers (except for cookie headers) and body. So if you change some detail in the request, that could result in getting a different response, the response will not be taken from the cache.

Compression

If you enable it, the FileCache compresses the responses it caches (using gzdecode()), so it needs less disk space.

$cache = new FileCache(__DIR__ . '/cachedir');

$cache->useCompression();

Attention: gzdecode() requires the ext-zlib PHP extension. If it's not installed the library will throw a MissingZlibExtensionException.

Retrying cached Error Responses

Sometimes it might happen that a website is down for a minute. Using a cache, the loader will also cache error responses and use those cached error responses in consecutive runs. If you want it to instead retry fetching those cached error responses, call the retryCachedErrorResponses() method of the HttpLoader class.

use Crwlr\Crawler\Cache\FileCache;
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    // define user agent

    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = new HttpLoader($userAgent, logger: $logger);

        $loader
            ->setCache(new FileCache(__DIR__ . '/cachedir'))
            ->retryCachedErrorResponses();

        return $loader;
    }
}

You can fine-tune this behavior even further. The retryCachedErrorResponses() method returns an object with the methods only() and except(), allowing you to retry requests based on specific HTTP error status codes.

use Crwlr\Crawler\Cache\FileCache;

$loader
    ->setCache(new FileCache(__DIR__ . '/cachedir'))
    ->retryCachedErrorResponses()
    ->only([400, 404, 500]);

// or

$loader
    ->setCache(new FileCache(__DIR__ . '/cachedir'))
    ->retryCachedErrorResponses()
    ->except([403, 404]);

Cache only certain URLs

When using the response cache, you can also restrict it, to only cache responses for certain URLs. This feature uses the same filter classes as the Step Output Filters:

use Crwlr\Crawler\Cache\FileCache;
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Loader\Http\HttpLoader;
use Crwlr\Crawler\Loader\LoaderInterface;
use Crwlr\Crawler\Steps\Filters\Filter;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
use Psr\Log\LoggerInterface;

class MyCrawler extends HttpCrawler
{
    protected function userAgent(): UserAgentInterface
    {
        return new BotUserAgent('MyCrawler');
    }

    protected function loader(UserAgentInterface $userAgent, LoggerInterface $logger): LoaderInterface
    {
        $loader = new HttpLoader($userAgent, logger: $logger);

        $loader
            ->setCache(new FileCache(__DIR__ . '/cachedir'))
            ->cacheOnlyWhereUrl(Filter::urlPathStartsWith('/foo'))
            ->cacheOnlyWhereUrl(Filter::urlHost('www.example.com'));

        return $loader;
    }
}