Documentation for crwlr / crawler (v2.0)

Attention: You're currently viewing the documentation for v2.0 of the crawler package.
This is not the latest version of the package.
If you didn't navigate to this version intentionally, you can click here to switch to the latest version.

Json

The Json step has three static methods:

  • Json::all() to extract the whole JSON object
  • Json::get() to cherry pick properties from the JSON object
  • and Json::each() to extract multiple items from the JSON object

Json::all()

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Json;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = HttpCrawler::make()->withUserAgent('MyCrawler');

$crawler
    ->input('https://www.example.com/json')
    ->addStep(Http::get())
    ->addStep(Json::all());

Json::get()

The Json::get() method works pretty much like the extract method of the Html and Xml steps. Thanks to adbario/php-dot-notation extracting data from JSON documents is really simple. Given the URL https://www.example.com/json responds with the following JSON:

{
    "data": {
        "something": "yolo",
        "target": {
            "foo": "Lorem ipsum",
            "bar": "dolor sit",
            "array": [
                { "baz": "zero" },
                { "baz": "one" },
                { "baz": "two" }
            ]
        }
    }
}

Cherry-pick your desired properties like this:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Json;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = HttpCrawler::make()->withUserAgent('MyCrawler');

$crawler
    ->input('https://www.example.com/json')
    ->addStep(Http::get())
    ->addStep(
        Json::get([
            'foo' => 'data.target.foo',
            'bar' => 'data.target.array.1.baz',
        ])
    );

The output of the JSON step then is:

array(2) {
  ["foo"]=>
  string(11) "Lorem ipsum"
  ["bar"]=>
  string(3) "one"
}

Json::each()

You can also extract multiple items from an array in the JSON object, by using the each method. Let's say the JSON looks like this:

{
    "list": {
        "people": [
            { "name": "Hans Zimmer", "age": { "years": 66 }, "home": "US" },
            { "name": "John Williams", "age": { "years": 92 }, "home": "US" },
            { "name": "Alan Silvestri", "age": { "years": 73 }, "home": "US" }
        ]
    }
}

You can get the names and ages like this:

use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Json;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler = HttpCrawler::make()->withUserAgent('MyCrawler');

$crawler
    ->input('https://www.example.com/json')
    ->addStep(Http::get())
    ->addStep(
        Json::each(
            'list.people',
            [ // provide the data mapping as second argument to the each() method.
                'name' => 'name',
                'age' => 'age.years'
            ]  
        )
    );

This yields 3 separate outpus:

array(2) {
  ["name"]=>
  string(11) "Hans Zimmer"
  ["age"]=>
  int(66)
}
array(2) {
  ["name"]=>
  string(13) "John Williams"
  ["age"]=>
  int(92)
}
array(2) {
  ["name"]=>
  string(14) "Alan Silvestri"
  ["age"]=>
  int(73)
}