Crwlr Recipes: Using a Crawler for Website Error Detection and Cache Warming

2025-01-20

Have you ever deployed your website or web app, only to discover hours later that you’ve introduced bugs or broken links? Or do you clear the cache with every deploy, leaving the first users to experience slow performance? In this guide, you’ll learn how to use a crawler to automatically detect errors and warm the cache, ensuring your site runs smoothly after every deployment.

The Use-Case

This type of crawler is particularly useful for the following two purposes:

Detecting Errors and Broken Links
After deploying a website or web app, issues such as error responses, broken links, or inaccessible pages can sometimes go unnoticed. By crawling your site, you can automatically find and report errors before they impact your users.

Cache Warming
If your deployment process involves clearing the cache to avoid potential side effects or bugs, the first users post-deployment might face slower page loads. Starting a crawler immediately after a successful deployment re-fills the cleared cache by visiting all relevant pages, ensuring users enjoy fast loading pages.

The Code

First, set up a store class that will be called with the result data from all the loaded pages:

use Crwlr\Crawler\Result;
use Crwlr\Crawler\Stores\Store;

class HealthCheckStore extends Store
{
    public function store(Result $result): void
    {
        if ($result->get('status') >= 400) { // Error response
            // Send a notification about the page being down.
        }
    }
}

And then define the crawling procedure:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = HttpCrawler::make()->withBotUserAgent('MyHealthCheckCrawler');

$crawler
    // Replace the input URL with the URL of your website that you want to crawl.
    ->input('https://www.crawler-test.com/links/broken_links_internal')
    ->addStep(
        Http::crawl()
            ->yieldErrorResponses()
            ->keep(['url', 'status'])
    )
    ->addStep(
        Html::metaData()
            ->only(['title'])
            ->keep()
    )
    ->setStore(new HealthCheckStore())
    ->runAndTraverse();

A key part of this setup is the call to the yieldErrorResponses() method. Without it, error responses would only be logged and not forwarded to the next step, meaning they would never reach the store.

In this example, the crawler keeps the url, the HTTP response status code, and the title of the loaded HTML pages. Of course, you can customize it to extract any other data needed for your specific health checks. Check out the documentation for the Http and Html steps to learn more about extracting various types of data.

You can also customize the Http::crawl() step further. For example, you can configure it to follow links only to a certain depth, or load only a specific number of links, and more. The previous Crwlr Recipe explored these customizations in detail - check it out here.

Partial Crawling

Another way to customize the Http::crawl() step is by configuring it to focus only on specific sections of your website. For instance, you can restrict it to crawl only URLs where the path starts with /foo/. To do this, replace the Http::crawl() step in the example above with the following:

Http::crawl()
    ->pathStartsWith('/foo/')
    ->yieldErrorResponses()
    ->keep(['url', 'status'])

For more details on additional ways to restrict the Http::crawl() step, refer to the documentation.

Integration into Your Deployment Process

To finally integrate this script into your deployment process, ensure it is triggered immediately after a successful deployment. For added reliability, you can wrap the script in a try/catch block to handle exceptions and report any errors. Additionally, also make sure the script can't run into a timeout.

I hope this guide was helpful and is a nice improvement to your deployment workflow, making it smoother and more reliable.