What's new in crwlr / crawler v0.4
Last friday version 0.4 of the crawler package was released with some pretty useful improvements. Read what's shipped with this new minor update.
Step Output Filters
Any step that extends the Step
class, shipped with the
package, now has the where()
and orWhere()
methods that
you can use to filter the step's outputs. Here's a quick
example from the docs:
$crawler->addStep(
Json::each('queenAlbums', ['title', 'year', 'chartsUK' => 'charts.uk', 'chartsUS' => 'charts.us'])
->where('year', Filter::greaterThan(1979))
->where('chartsUS', Filter::equal(1))
->orWhere('chartsUK', Filter::equal(1))
);
There's not only simple filter methods like equal
,
greaterThan
, lessThan
, and so on. But also string filters
like stringContains
, stringStartsWith
and even filters
specially made to filter urls by its components like
urlHost
, urlDomain
, urlPath
and so on.
Would you maybe like to contribute?
The list of available filters is actually not very big yet, and I could think of a lot of useful filters to have here. If you have an idea, and you'd consider contributing, I think this could be a rather easy thing to add some filter methods. Just reach out to me on twitter if you have any questions about it.
New Constraints for Html::getLink() and Html::getLinks() Steps
These steps now have a few new methods to restrict the links that it shall find:
// Only links to urls on the same domain.
Html::getLinks()->onSameDomain();
// Only links to urls not on the same domain.
Html::getLinks()->notOnSameDomain();
// Only links to urls on (a) certain domain(s).
Html::getLinks()->onDomain('example.com');
Html::getLinks()->onDomain(['example.com', 'crwl.io']);
// Only links to urls on the same host (includes subdomain).
Html::getLinks()->onSameHost();
// Only links to urls not on the same host.
Html::getLinks()->notOnSameHost();
// Only links to urls on (a) certain host(s)
Html::getLinks()->onDomain('blog.example.com');
Html::getLinks()->onDomain(['blog.example.com', 'www.crwl.io']);
The steps know the url of the HTML document, because it can
only be used immediately after an Http
step. This way you
can get all the internal (same host/domain) or external
(not same host/domain) links. Or even all the links to any
list of hosts/domains.
If you're not sure if you should filter for host (includes
subdomain like www
) or domain (only the registrable domain
like example.com
), consider the following:
Sometimes pages have parts of a website (that you'd consider
one website) on separate subdomains, like jobs.example.com
or blog.example.com
. On the other hand sometimes there are
also bigger organizations having actual different websites
(e.g. for several companies of a group) on the same domain
which you maybe don't want to crawl. So, there is no general
answer for this, just have a look at the pages you'd like to
crawl.
Stores now also get the Logger
The crawler automatically passes on the logger to all the
steps you add and from this version on it also does for
stores. This can be
breaking (if you're wondering: 0.x versions can also
contain breaking things
as defined in semver) as
the StoreInterface now also requires theaddLogger()
method. The new abstract Store class already implements it,
so you can just extend it.
Use Csv Step without Column Mapping
The Csv step can now also be used without defining a column mapping. In that case it will use the values from the first line (so this makes sense when there are column headlines) as output array keys.
$csv = <<<CSV
id,firstname,surname
1,john,doe
2,jane,doe
CSV;
$crawler = new MyCrawler();
$crawler->input($csv);
$crawler->addStep(
Csv::parseString()
->skipFirstLine()
->addKeysToResult()
);
Gets the following results:
array(3) {
["id"]=>
string(1) "1"
["firstname"]=>
string(4) "john"
["surname"]=>
string(3) "doe"
}
array(3) {
["id"]=>
string(1) "2"
["firstname"]=>
string(4) "jane"
["surname"]=>
string(3) "doe"
}