HTML Steps

There are 2 different kinds of steps available via static methods of the Html class. The ones to get links (URLs) from HTML documents and the others to select data/text either from HTML elements via CSS selectors (or XPath queries) or from meta tags or schema.org objects in script blocks.

Getting (absolute) Links

This can only be used with an instance of RespondedRequest as input, so immediately after an HTTP loading step. The reason for this is, that it needs to know the base URL of the document to resolve relative links in the document to absolute ones.

There are 2 different methods, you can either get one, or all links (matching a CSS selector).

`Html::getLink()`

It takes the first link (matching the CSS selector => optional).

use Crwlr\Crawler\Steps\Html;

Html::getLink();

Html::getLink('#listing #nextPage');

`Html::getLinks()`

Exact same, but gets you all matching links as separate outputs.

use Crwlr\Crawler\Steps\Html;

Html::getLinks();

Html::getLinks('.matchingLink');

In both methods, if your CSS selector matches an element that is not a link (<a>) element, it is ignored.

Both steps provide the following chainable methods to filter:

use Crwlr\Crawler\Steps\Html;

// Only links to URLs on the same domain.
Html::getLinks()->onSameDomain();

// Only links to URLs not on the same domain.
Html::getLinks()->notOnSameDomain();

// Only links to URLs on (a) certain domain(s).
Html::getLinks()->onDomain('example.com');

Html::getLinks()->onDomain(['example.com', 'crwl.io']);

// Only links to URLs on the same host (includes subdomain).
Html::getLinks()->onSameHost();

// Only links to URLs not on the same host.
Html::getLinks()->notOnSameHost();

// Only links to URLs on (a) certain host(s)
Html::getLinks()->onDomain('blog.example.com');

Html::getLinks()->onDomain(['blog.example.com', 'www.crwl.io']);

Get Links without Fragment Part

Sometimes, if a website uses links containing a fragment part (like example.com/path#fragment), you might want to miss out that part of the links, because in most cases servers will respond with the same content when requesting example.com/path#fragment and example.com/path#otherfragment. So, to not load the same resource multiple times, we can get all links without the fragment part, calling the withoutFragment() method of the steps.

use Crwlr\Crawler\Steps\Html;

Html::getLink()->withoutFragment();

// or

Html::getLinks()->withoutFragment();

Extracting Data

The main method to select data is extract() but you always have to use it in combination with one of: root, each, first or last.

use Crwlr\Crawler\Steps\Html;

Html::root()->extract('h1');

Html::root()->extract(['title' => 'h1', 'date' => '#main .date']);

Html::each('#listing .item')->extract(['title' => 'h1', 'date' => '#main .date']);

Html::first('#listing .item')->extract(['title' => 'h1', 'date' => '#main .date']);

Html::last('#listing .item')->extract(['title' => 'h1', 'date' => '#main .date']);

It should be pretty clear with this example. root is used to just extract data from the root of the document. each, first and last are all used to extract data from a list of similar items. each is the only one that yields multiple outputs.

The extract method can be used with a single selector or an array of selectors with keys to name the data properties being extracted.

Nesting Extracted Data

If you use the extract() method with a mapping array, you can also use another Html step as value to achieve nesting.

use Crwlr\Crawler\Steps\Html;

Html::root()
    ->extract([
        'title' => '#event h2',
        'date' => '#event .date',
        'talks' => Html::each('#event #talks .item')->extract([
            'title' => 'h3',
            'speaker' => '.speaker .name',
        ])
    ]);

Accessing other Node Values besides Text

By default, the CSS selectors return the text of the selected node. But of course you can also get other values:

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;

Html::last('#listing .item')->extract([
    'default' => Dom::cssSelector('.default')->text(),
    'formatted' => Dom::cssSelector('.formatted')->formattedText(),
    'foo' => Dom::cssSelector('.foo')->innerText(),
    'bar' => Dom::cssSelector('.bar')->html(),
    'baz' => Dom::cssSelector('.baz')->outerHtml(),
    'test' => Dom::cssSelector('.test')->attribute('data-test'),
]);

text
You don't have to use this explicitly, it's the default when you only provide the selector as string. It gets the text inside the node including children.

formattedText
If you want to scrape longer text, something like an article, you can utilize this to receive formatted plain text. This uses the crwlr/html-2-text package under the hood. If you want to use your customized converter, pass it to the method like this: Dom::cssSelector('.formatted')->formattedText($myCustomHtml2TextConverter). Read more about this in the crwlr/html-2-text docs.

innerText
Gets only the text directly inside the node. Excludes text from child nodes.

html
Gets the html source inside the selected element.

outerHtml
Gets the html of the selected element including the element itself.

attribute(x)
Gets the value inside attribute x of the selected element.

Getting first, last, nth, even or odd element(s)

CSS has selectors like :first-child, :nth-child(n), and so on. But they are easily misunderstood. For example: #content a:first-child is not just the first link inside the element with id="content", but: the first element inside the id="content" element only if it's a link. So, when the first child element inside that element is for example a <span>, the selector won't match anything.

To just get the first link inside the id="content" element, you can use the first() method of the object returned from Dom::cssSelector():

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;

Html::root()->extract([
    'firstLink' => Dom::cssSelector('#content a')->first(),
]);

Just like first() there are also: last(), nth(), even() and odd().

Getting Absolute Links (URLs) in Extracted Data

The Html::getLink() and Html::getLinks() are easiest to use, to get only the (absolute) URLs to follow, to then extract data from the pages behind those links. What if you want to get absolute links within the data you're extracting from a page? The DomQuery class (the abstract base class behind Dom::cssSelector() and Dom::xPathQuery()) has a method link() that will get the absolute URL behind a selected link element:

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/foo')
    ->addStep(Http::get())
    ->addStep(
        Html::each('#listing .row')
            ->extract([
                'title' => 'a.title',
                'url' => Dom::cssSelector('a.title')->link(),
            ])
    );

In order for this to work, the step immediately before the step that is extracting the data, needs to be an HTTP loading step.

Just like the Html::getLink() and Html::getLinks() steps, the DomQuery class also has a withoutFragment() method, so you can do:

Dom::cssSelector('a.title')->link()->withoutFragment()

In case you want to get an absolute URL from something else than an HTML link element. Let's say you want to extract the src URL of an image. For this case there is the DomQuery::toAbsoluteUrl() method.

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;

$crawler->input('https://www.example.com/foo')
    ->addStep(Http::get())
    ->addStep(
        Html::each('#listing .row')
            ->extract([
                'title' => 'a.title',
                'image' => Dom::cssSelector('img.thumbnail')->attribute('src')->toAbsoluteUrl(),
            ])
    );

Using XPath instead of CSS Selectors

The Xml and Html steps both have the same base class (Dom) that behind the scenes uses the symfony DomCrawler to extract data. As default, Html steps use CSS selectors and Xml steps use XPath queries. But if you want to, you can also use XPath for Html:

use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;

Html::each(Dom::xPath('//div[@id=\'bookstore\']/div[@class=\'book\']'))
    ->extract([
        'title' => Dom::xPath('//h3[@class=\'title\']'),
        'author' => Dom::xPath('//*[@class=\'author\']'),
        'year' => Dom::xPath('//span[@class=\'year\']'),
    ]);

Extracting Metadata

There is another step providing a very convenient way to extract Metadata from HTML documents:

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Html;

$crawler->input('https://www.crwlr.software/')
    ->addStep(Http::get())
    ->addStep(
        Html::metaData()
            ->only(['title', 'description', 'og:image'])
    );

The Html::metaData() step by default will give you all the Metadata from the <head> of the HTML document (all <meta> tags having a name or property attribute and also the content of the <title> tag). Using only() you can choose which ones you want to get.

Extracting schema.org (JSON-LD) structured data

schema.org is a great standard for website-owners to provide machine-readable structured data hidden in the source of HTML documents. The Html::schemaOrg() step allows you to easily extract such data.

use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Html;

$crawler->input('https://www.example.com/foo')
    ->addStep(Http::get())
    ->addStep(
        Html::schemaOrg()
            ->onlyType('JobPosting')
            ->extract([
                'title',
                'description',
                'company' => 'hiringOrganization.name',
            ])
    );

This configuration will find only schema.org objects of type JobPosting and extract only title, description and hiringOrganization.name. As you can see, you can use dot notation (just like with the JSON step) to extract properties, and also map them to more reasonable property names in the step's output (hiringOrganization.name becomes company).

Under the hood, this step is using spatie's schema-org package. So by default, the step will return instances of the objects from this library. You can call the toArray() method on the step to get array output instead. Also, when using extract() you'll automatically get array output.

Documentation for crwlr / crawler (v2.1)