Documentation for crwlr / crawler (v3.2)

Composing Results

There are several methods available to help merge data from different stages of the crawling procedure into the final crawling results. This guide illustrates these methods using the example use case of scraping job ads from websites. Different scenarios are covered to demonstrate how these methods can be applied based on the structure and content of the target website. In all examples, we aim to get at least the properties url, title, location and content for all the jobs.

The Problem

Consider a website with a page that lists multiple jobs (on /jobs). The listing contains the job titles, the location and a link to a job detail page (like /jobs/accountant-in-vienna-123). We need to follow the link and get the content from the detail page.

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/jobs')
    ->addStep(Http::get())
    ->addStep(
        Html::each('#searchresults .row')
            ->extract([
                'title' => '.jobTitle',
                'url' => Dom::cssSelector('.jobTitle a')->link(),
                'location' => '.jobLocation',
            ]),
    )
    ->addStep(Http::get()->useInputKey('url'))
    ->addStep(
        Html::root()->extract([
            'content' => Dom::cssSelector('#jobAd')->formattedText(),
        ]),
    );

$crawler->runAndDump();

With this example code, the final crawling result objects will only contain the content because, by default, the crawling results are only the outputs of the final step. Therefore, we need a way to add data from the outputs of the second step to the final crawling result objects.

Keeping Output Data with Step::keep()

The Step::keep() method retains specified keys from a step's output data. The retained keys will be present in the final crawling results and are also available to be used with the useInputKey() method, on any step coming after the step where it was kept. If called without arguments on a step that produces array (or object) outputs with keys, it keeps all output data.

Example

We can change the example code above to:

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/jobs')
    ->addStep(Http::get())
    ->addStep(
        Html::each('#searchresults .row')
            ->extract([
                'title' => '.jobTitle',
                'url' => Dom::cssSelector('.jobTitle a')->link(),
                'location' => '.jobLocation',
            ])
            ->keep(),   // <- The keep() method call, meaning: Keep the output data
                        // from this step and pass it on until the end, to the
                        // final crawling result objects.
    )
    ->addStep(Http::get()->useInputKey('url'))
    ->addStep(
        Html::root()->extract([
            'content' => Dom::cssSelector('#jobAd')->formattedText(),
        ]),
    );

In this case, the final crawling result objects will have the keys title, url, location, and content.

If you want to keep only the title, you can call the keep() method with that key. Replace the second step with:

Html::each('#searchresults .row')
    ->extract([
        'title' => '.jobTitle',
        'url' => Dom::cssSelector('.jobTitle a')->link(),
        'location' => '.jobLocation',
    ])
    ->keep('title') // Calling keep() with a single output property key.

If you want to keep multiple keys, but not the full output data, the keep() method can also be called with an array of keys:

Html::each('#searchresults .row')
    ->extract([
        'title' => '.jobTitle',
        'url' => Dom::cssSelector('.jobTitle a')->link(),
        'location' => '.jobLocation',
    ])
    ->keep(['title', 'location']) // Calling keep() with multiple output property keys.

Assigning Keys to Kept Values with Step::keepAs()

The keepAs() method is used when a step yields scalar value outputs. In this case we need to define a key that the value will be added with to the result.

Example

Let's change the example above to also take the title and location from the detail page, and only the url from the list page. As the URLs are now the only thing we need from the list pages, we can use the Html::getLinks() step, which is simpler for this use-case. But also it yields only scalar values (URL strings).

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/jobs')
    ->addStep(Http::get())
    ->addStep(
        Html::getLinks('#searchresults .row .jobTitle a')
            ->keepAs('url'), // Html::getLinks() produces (multiple) scalar value
                             // outputs (URL strings), so assign the key 'url' to
                             // the kept URL values.
    )
    ->addStep(Http::get())
    ->addStep(
        Html::root()->extract([
            'title' => 'h1 [itemprop=title]',
            'location' => '.location .city',
            'content' => Dom::cssSelector('#jobAd')->formattedText(),
        ]),
    );

Keeping Input Data with Step::keepFromInput()

The Step::keepFromInput() method retains specified keys from the input data the step receives instead from the outputs it yields. This can be helpful if you immediately want to add data from your initial inputs, or also for sub crawlers (more, further below).

Example

This time we can't scrape the job location from anywhere on the pages, but we have a list of multiple list pages for different locations, and we can add the location from our own initial inputs.

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->inputs([
        ['url' => 'https://www.example.com/jobs/vienna', 'location' => 'Vienna'],
        ['url' => 'https://www.example.com/jobs/london', 'location' => 'London'],
    ])
    ->addStep(
        Http::get()->keepFromInput(['location']), // Keep the key 'location' from this step's inputs.
    )
    ->addStep(
        Html::each('#jobList .jobItem')->extract([
            'title' => 'h4',
            'url' => Dom::cssSelector('a.detail')->link(),
        ])->keep(),
    )
    ->addStep(Http::get()->useInputKey('url'))
    ->addStep(
        Html::root()->extract([
            'content' => Dom::cssSelector('.jobContent')->formattedText(),
        ]),
    );

Assigning Keys to Kept Input Values with Step::keepInputAs()

The Step::keepInputAs() method is used with scalar value inputs to assign a key to those values.

Example

Now we have a list of job detail links as initial inputs for the crawler. We need to keep those as the job URLs, everything else is scraped from the detail pages.

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->inputs([
        'https://www.example.com/jobs/1234',
        'https://www.example.com/jobs/1235',
    ])
    ->addStep(
        Http::get()->keepInputAs('url'), // Keep this step's inputs as key 'url'.
    )
    ->addStep(
        Html::schemaOrg()->onlyType('JobPosting')->extract([
            'title',
            'content' => 'description',
            'location' => 'jobLocation.0.address.addressLocality',
        ]),
    );

Nested Result Data with Sub Crawlers

Sub crawlers are a powerful feature that allows you to extract nested data from multiple levels within a website.

The Step::subCrawlerFor() method allows you to start a child crawling procedure for each output the step yields. This sub crawler uses an output property as its initial input and, after running, replaces that property's value with the results of its crawl.

Example

In this example, we want to include the company name and website for each job posting. The company information is located on a separate page linked from the job detail page.

[
    'url' => '...',
    'title' => '...',
    'location' => '...',
    'company' => [
        'name' => '...',
        'website' => 'https://...',
    ],
]

The company is linked on the job detail page, so the crawler needs to follow that link to retrieve the company data. This is where a sub crawler comes in. You can start a sub crawler from any step by calling the subCrawlerFor() method. In this example, we start it from the Html step that extracts data from the job detail page. The step selects the job content and the link URL to the company detail page. Here’s the step definition:

Html::root()
    ->extract([
        'content' => Dom::cssSelector('.jobContent')->formattedText(),
        'company' => Dom::cssSelector('.jobCompany a.companyDetailLink')->link(),
    ])

This step's output data will look like:

[
    'content' => '...',
    'company' => 'https://www.example.com/companies/example-company-123',
]

You might think a more precise name for the company property would be companyUrl. However, it is intentionally named company because the sub crawler will replace that property's value with its crawling result.

The first argument for the sub crawler is the property name from the output that it uses as its initial input. In this case, it’s the company property, which contains the URL of the company detail page. The second argument is a callback function that receives the sub crawler (a clone of the Crwlr\Crawler\Crawler instance) as an argument. In this callback, we define what the sub crawler does. In our example, it loads the company detail page and retrieves the company data:

Html::root()
    ->extract([
        'content' => Dom::cssSelector('.jobContent')->formattedText(),
        'company' => Dom::cssSelector('.jobCompany a.companyDetailLink')->link(),
    ])
    ->subCrawlerFor('company', function (Crawler $crawler) {
        return $crawler
            ->addStep(Http::get())
            ->addStep(
                Html::root()->extract([
                    'name' => 'h1.company',
                    'website' => Dom::cssSelector('.company-address a')->link(),
                ]),
            );
    })

The outputs of the sub crawler will look like:

[
    'name' => 'Example Company Ltd.',
    'website' => 'https://www.example.com',
]

After the sub crawler completes its task, the parent step's output will look like:

[
    'content' => '...',
    'company' => [
        'name' => 'Example Company Ltd.',
        'website' => 'https://www.example.com',
    ],
]

Finally, here is how the entire crawling procedure looks:

use Crwlr\Crawler\Crawler;
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.example.com/jobs')
    ->addStep(Http::get())
    ->addStep(
        Html::each('#jobList .jobItem')->extract([
            'title' => 'h4',
            'url' => Dom::cssSelector('a.detail')->link(),
            'location' => '.location .city',
        ])->keep(),
    )
    ->addStep(Http::get()->useInputKey('url'))
    ->addStep(
        Html::root()
            ->extract([
                'content' => Dom::cssSelector('.jobContent')->formattedText(),
                'company' => Dom::cssSelector('.jobCompany a.companyDetailLink')->link(),
            ])
            ->subCrawlerFor('company', function (Crawler $crawler) {
                return $crawler
                    ->addStep(Http::get())
                    ->addStep(
                        Html::root()->extract([
                            'name' => 'h1.company',
                            'website' => Dom::cssSelector('.company-address a')->link(),
                        ]),
                    );
            }),
    );

Merging Initial Property Value and Sub Crawler Result Data

Sub crawlers are also typical use cases for the Step::keepFromInput() or Step::keepInputAs() methods. For instance, in our example, if you want to include the company detail page URL in the company result data, you can retain it from the initial input that the sub crawler receives.

Here's how you can achieve that:

Html::root()
    ->extract([
        'content' => Dom::cssSelector('.jobContent')->formattedText(),
        'company' => Dom::cssSelector('.jobCompany a.companyDetailLink')->link(),
    ])
    ->subCrawlerFor('company', function (Crawler $crawler) {
        return $crawler
            ->addStep(
                Http::get()
                    ->keepInputAs('companyDetailUrl') // Keep the company detail URL from the initial
                                                      // 'company' property value, which the sub crawler
                                                      // receives as its initial input.
            )
            ->addStep(
                Html::root()->extract([
                    'name' => 'h1.company',
                    'website' => Dom::cssSelector('.company-address a')->link(),
                ]),
            );
    })