Documentation for crwlr / crawler (v2.1)

Refining Outputs

When scraping data from the web you will probably like to clean up the data you've extracted from some website. The Step::refineOutput() method, available with any step, enables you to do so.

Refine data using a Closure

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.some-example-weather.site/vienna/today')
    ->addStep(Http::get())
    ->addStep(
        Html::each('#forecast .weatherForHour')
            ->extract([
                'hour' => '.time',
                'temperature' => '.temp',
                'humidity' => '.humidity',
            ])
            ->refineOutput(function (mixed $outputData, mixed $originalInputData) {
                $outputData['temperature'] = trim(str_replace('°', '', $outputData['temperature']));

                $outputData['humidity'] = trim(str_replace('%', '', $outputData['humidity']));

                return $outputData;
            })
    );

When providing a closure to the refineOutput() method, the closure will receive the step's output value as first argument and also the original input as second argument, in case you want to somehow refine based on the input value that caused the output. In most cases you won't need the input value, you can just omit it in this case. What the closure needs to return is the updated output value.

If you just want to change one of the elements from array output, you can provide the key as first argument to the refineOutput() method, and the closure as the second one. The incoming output value in the closure, will then be the value of that array key.

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler
    ->input('https://www.some-example-weather.site/vienna/today')
    ->addStep(Http::get())
    ->addStep(
        Html::each('#forecast .weatherForHour')
            ->extract([
                'hour' => '.time',
                'temperature' => '.temp',
                'humidity' => '.humidity',
            ])
            ->refineOutput('temperature', fn ($output) => str_replace('°', '', $output))
    );

Refiners

The method not only accepts closures, but also instances of the Crwlr\Crawler\Steps\Refiners\RefinerInterface. The package is shipped with a few so-called refiners, that you can pass to refineOutput() instead, which can improve readability a lot.

use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Refiners\StringRefiner;

$crawler
    ->input('https://www.some-example-weather.site/vienna/today')
    ->addStep(Http::get())
    ->addStep(
        Html::each('#forecast .weatherForHour')
            ->extract([
                'hour' => '.time',
                'temperature' => '.temp',
                'humidity' => '.humidity',
            ])
            ->refineOutput('temperature', StringRefiner::replace('°', ''))
    );

Available Refiners

String Refiners

use Crwlr\Crawler\Steps\Refiners\StringRefiner;

StringRefiner::afterFirst('foo');   // Rest of the string after the first occurrence of "foo".

StringRefiner::afterLast('foo');    // Rest of the string after the last occurrence of "foo".

StringRefiner::beforeFirst('foo');  // String before the first occurrence of "foo".

StringRefiner::beforeLast('foo');   // String before the last occurrence of "foo".

// Everything between the first occurrence of "foo" and the next occurrence of "bar" after that "foo".
StringRefiner::betweenFirst('foo', 'bar'); 

// Everything between the last occurrence of "foo" and the next occurrence of "bar" after that "foo".
StringRefiner::betweenLast('foo', 'bar'); 

// Find and replace.
StringRefiner::replace('°', '');
// Can also take arrays of strings, like:
StringRefiner::replace(['foo', 'bar'], ['FOO', 'BAR']);

By the way: all those string refiners automatically trim the refined string.

URL Refiners

use Crwlr\Crawler\Steps\Refiners\UrlRefiner;

UrlRefiner::withScheme('https');          // Sets scheme to "https"
// E.g. http://example.com  =>  https://example.com

UrlRefiner::withHost('www.example.com');  // Sets the host to "www.example.com"
// E.g. https://example.com  =>  https://www.example.com

UrlRefiner::withPort(1234);               // Sets the port to "1234"
// E.g. https://example.com/foo  =>  https://example.com:1234/foo

UrlRefiner::withoutPort();                // Removes the port.
// E.g. https://example.com:1234/foo  =>  https://example.com/foo

UrlRefiner::withPath('/contact');         // Sets the path to "/contact"
// E.g. https://example.com/foo  =>  https://example.com/contact

UrlRefiner::withQuery('a=b&c=d');         // Sets the query to "a=b&c=d"
// E.g. https://example.com/foo?foo=bar  =>  https://example.com/foo?a=b&c=d

UrlRefiner::withoutQuery();               // Removes the query.
// E.g. https://example.com/foo?foo=bar  =>  https://example.com/foo

UrlRefiner::withFragment('foo');          // Sets the fragment to "foo".
// E.g. https://example.com/home  =>  https://example.com/home#foo

UrlRefiner::withoutFragment();            // Removes the fragment.
// E.g. https://example.com/home#foo  =>  https://example.com/home