What's new in crwlr / crawler v0.6?
Version 0.6 is probably the biggest update so far with a lot of new features and steps from crawling whole websites, over sitemaps to extracting metadata and schema.org structured data from HTML. Here is an overview of all the new stuff.
ℹ️ First an important info, if you're already using the library:
0.x versions are still "development versions" and can potentially contain changes that are breaking backwards compatibility. I try to avoid it and there won't be a lot, but this version contains one breaking change:
The PoliteHttpLoader
(and the traits WaitPolitely
and CheckRobotsTxt
) have been removed. The politeness features are now baked into (dependencies of) the HttpLoader
. Throttling (WaitPolitely
) is done by default, but you can configure it and your crawler loads and respects robots.txt
files depending on if you're using a BotUserAgent
. More on this on the new Documentation Page about Politeness.
Crawling whole Websites
The new Http::crawl()
step allows you to easily crawl whole websites, and it also has a lot of options, like:
- Only crawl to a certain depth.
- Start from a sitemap.
- Stay on the same domain or on the same host.
- Only load URLs with certain paths.
- ...
In this example you start with a sitemap and load only URLs with a path starting with /foo/
:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler->input('https://www.example.com/sitemap.xml')
->addStep(
Http::crawl()
->inputIsSitemap()
->pathStartsWith('/foo/')
);
You can read more about this feature here.
New Sitemap Steps
There is 2 new steps to work with sitemaps. One to get all sitemap URLs listed in the robots.txt
file of some website. And the other one to get all the URLs (optionally also with the additional data like priority) listed in the sitemap.
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Sitemap;
$crawler->input('https://www.example.com/')
->addStep(Sitemap::getSitemapsFromRobotsTxt())
->addStep(Http::get())
->addStep(Sitemap::getUrlsFromSitemap());
Read more about these steps here.
Extracting Metadata and schema.org structured data from HTML documents
There are two new HTML steps to easily extract Metadata and schema.org structured data (in JSON-LD format) from HTML documents:
Extracting Metadata
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Html;
$crawler->input('https://www.crwlr.software/')
->addStep(Http::get())
->addStep(
Html::metaData()
->only(['title', 'description', 'og:image'])
);
This step gets you all the data from <meta>
tags which have a name
or property
attribute and also the title from the <title>
tag. Read more here.
Extracting schema.org structured data
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Html;
$crawler->input('https://www.example.com/foo')
->addStep(Http::get())
->addStep(
Html::schemaOrg()
->onlyType('JobPosting')
->extract([
'title',
'description',
'company' => 'hiringOrganization.name',
])
);
CSS selector first(), last(), nth(), even() and odd() methods
CSS has selectors like :first-child
, :nth-child(n)
, :last-child
and so on. But they are easily misunderstood. For example: #main a:first-child
is not just the first link inside the element with id="main"
, but: the first element inside the id="main"
element only if it is a link. So, when the first child element inside that element is for example a <div>
, the selector won't match anything.
Now you can solve this like:
use Crwlr\Crawler\Steps\Dom;
use Crwlr\Crawler\Steps\Html;
Html::root()->extract([
'firstLink' => Dom::cssSelector('#main a')->first(),
]);