XML Steps
The Xml
step extends the same base class (Dom
) as the Html
step but uses XPath queries as default, instead of CSS selectors. So selecting data from an XML document looks pretty much the same as selecting from HTML:
use Crwlr\Crawler\Steps\Xml;
Xml::root()->extract('//title');
Xml::root()->extract(['title' => '//title', 'author' => '//author']);
Xml::each('bookstore/book')->extract(['title' => '//title', 'author' => '//author']);
Xml::first('bookstore/book')->extract(['title' => '//title', 'author' => '//author']);
Xml::last('bookstore/book')->extract(['title' => '//title', 'author' => '//author']);
root
is used to just extract data from the root of the document. each
, first
and last
are all used to extract data from a list of similar items. each
is the only one that yields multiple outputs.
The extract
method takes either a single xPath query or an array of queries with keys to name the data properties being extracted.
Nesting Extracted Data
If you use the extract()
method with a mapping array, you can also use another Xml
step as value to achieve nesting.
use Crwlr\Crawler\Steps\Xml;
Xml::each('//events/event')
->extract([
'title' => '//name',
'location' => '//location',
'date' => '//date',
'talks' => Xml::each('//talks/talk')->extract([
'title' => '//title',
'speaker' => '//speaker',
])
]);
Accessing other Node Values
By default, the XPath queries return the text of the selected node. But of course you can also get other values:
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Xml;
Xml::first('listing/item')->extract([
'default' => Dom::xPath('//default')->text(),
'foo' => Dom::xPath('//foo')->innerText(),
'bar' => Dom::xPath('//bar')->html(),
'baz' => Dom::xPath('//baz')->outerHtml(),
'test' => Dom::xPath('//test')->attribute('test'),
]);
text
You don't have to use this explicitly, it's the default
when you only provide the selector as string. It gets the
text inside the node including children.
innerText
Gets only the text directly inside the node. Excludes text
from child nodes.
html
Gets the xml source inside the selected element.
outerHtml
Gets the xml source of the selected element including the
element itself.
attribute(x)
Gets the value inside attribute x of the selected element.
Using CSS selectors instead of XPath queries
As default, Xml steps use XPath queries, but if you want to, you can also use CSS selectors for Xml:
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Xml;
Xml::each(Dom::cssSelector('bookstore book'))
->extract([
'title' => Dom::cssSelector('title'),
'author' => Dom::cssSelector('author'),
'year' => Dom::cssSelector('year'),
]);