HTML Steps
There are 2 different kinds of steps available via static methods of the Html
class. The ones to get links (URLs) from HTML documents and the others to select data/text via CSS selectors (or XPath queries).
Getting (absolute) Links
This can only be used with an instance of RespondedRequest
as input, so immediately after an HTTP loading step. The reason for this is, that it needs to know the base URL of the document to resolve relative links in the document to absolute ones.
There are 2 different methods, you can either get one, or all links (matching a CSS selector).
Html::getLink()
It takes the first link (matching the CSS selector => optional).
Html::getLink();
Html::getLink('#listing #nextPage');
Html::getLinks()
Exact same, but gets you all matching links as separate outputs.
Html::getLinks();
Html::getLinks('.matchingLink');
In both methods, if your CSS selector matches an element
that is not a link (<a>
) element, it is ignored.
Both steps provide the following chainable methods to filter:
// Only links to URLs on the same domain.
Html::getLinks()->onSameDomain();
// Only links to URLs not on the same domain.
Html::getLinks()->notOnSameDomain();
// Only links to URLs on (a) certain domain(s).
Html::getLinks()->onDomain('example.com');
Html::getLinks()->onDomain(['example.com', 'crwl.io']);
// Only links to URLs on the same host (includes subdomain).
Html::getLinks()->onSameHost();
// Only links to URLs not on the same host.
Html::getLinks()->notOnSameHost();
// Only links to URLs on (a) certain host(s)
Html::getLinks()->onDomain('blog.example.com');
Html::getLinks()->onDomain(['blog.example.com', 'www.crwl.io']);
Selecting Data
The main method to select data is extract()
but you
always have to use it in combination with one of: root
,
each
, first
or last
.
Html::root()->extract('h1');
Html::root()->extract(['title' => 'h1', 'date' => '#main .date']);
Html::each('#listing .item')->extract(['title' => 'h1', 'date' => '#main .date']);
Html::first('#listing .item')->extract(['title' => 'h1', 'date' => '#main .date']);
Html::last('#listing .item')->extract(['title' => 'h1', 'date' => '#main .date']);
It should be pretty clear with this example. root
is used to just extract data from the root of the document. each
, first
and last
are all used to extract data from a list of similar items. each
is the only one that yields multiple outputs.
The extract
method can be used with a single selector or an array of selectors with keys to name the data properties being extracted.
Accessing other Node Values
By default, the CSS selectors return the text of the selected node. But of course you can also get other values:
Html::last('#listing .item')->extract([
'default' => Dom::cssSelector('.default')->text(),
'foo' => Dom::cssSelector('.foo')->innerText(),
'bar' => Dom::cssSelector('.bar')->html(),
'baz' => Dom::cssSelector('.baz')->outerHtml(),
'test' => Dom::cssSelector('.test')->attribute('data-test'),
]);
text
You don't have to use this explicitly, it's the default
when you only provide the selector as string. It gets the
text inside the node including children.
innerText
Gets only the text directly inside the node. Excludes text
from child nodes.
html
Gets the html source inside the selected element.
outerHtml
Gets the html of the selected element including the element
itself.
attribute(x)
Gets the value inside attribute x of the selected element.
Converting Relative Paths to Absolute Links in Extracted Data
The Html::getLink()
and Html::getLinks()
are mainly there to get (absolute) URLs to follow, to then extract data from the pages behind those links. What if you want to get absolute links within the data you're extracting from a page? The DomQuery
class (the abstract base class behind Dom::cssSelector()
and Dom::xPathQuery()
) has a method toAbsoluteUrl()
that will convert the selected value to an absolute URL:
$crawler = new MyCrawler();
$crawler->input('https://www.example.com/foo')
->addStep(Http::get())
->addStep(
Html::each('#listing .row')
->extract([
'title' => 'a.title',
'url' => Dom::cssSelector('a.title')->attribute('href')->toAbsoluteUrl(),
])
);
// Run crawler and process results
In order for this to work, the step immediately before the step that is extracting the data, needs to be an HTTP loading step.
Using XPath instead of CSS Selectors
The Xml and Html steps both have the same base
class (Dom
) that behind the scenes uses the
symfony DomCrawler
to extract data. As default, Html steps use CSS selectors
and Xml steps use XPath queries. But if you want to, you
can also use XPath for Html:
Html::each(Dom::xPath('//div[@id=\'bookstore\']/div[@class=\'book\']'))
->extract([
'title' => Dom::xPath('//h3[@class=\'title\']'),
'author' => Dom::xPath('//*[@class=\'author\']'),
'year' => Dom::xPath('//span[@class=\'year\']'),
]);