Getting Started
This package provides kind of a framework and a lot of ready to use, so-called steps, that you can combine to build your own crawlers and scrapers with. But first let's clarify the meaning of those two terms.
What's the Difference between Crawling and Scraping
For most use cases those two things go hand in hand which is why this library helps with and combines both.
What is a Crawler?
A (web) crawler is a program that (down)loads documents and follows the links in it to load them as well. A crawler could just load actually all links it is finding (and is allowed to load according to the robots.txt file), then it would just load the whole internet (if the url(s) it starts with are no dead end). Or it can be restricted to load only links matching certain criteria (on same domain/host, url path starts with "/foo",...) or only to a certain depth. A depth of 3 means 3 levels deep. Links found on the initial urls provided to the crawler are level 1 and so on.
What is a Scraper?
A scraper extracts data from a document. Crawling only gets you the documents that you're looking for, but in most use cases you also want to extract certain data from those documents which is called scraping.
That being said: in this project the term crawling is preferred,
but most of the time it also includes scraping. The class
that you need to extend is called Crawler
but it's here for
both, crawling and scraping.
Requirements
Requires PHP version 8.1 or above.
Installation
composer require crwlr/crawler
Usage
To build a crawler you always need to make your own class
extending the Crawler
or HttpCrawler
class. In a class
extending the HttpCrawler
you need to at least define a
user agent for your crawler, which can be a name for your
Crawler/Bot or any browser user-agent string.
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
class MyCrawler extends HttpCrawler
{
protected function userAgent(): UserAgentInterface
{
return BotUserAgent::make('MyBot');
}
}
A very simple example to extract data from some articles that we get from a listing of articles would look like this:
$crawler->input('https://www.example.com/articles');
$crawler->addStep(Http::get()) // Load the listing page
->addStep(Html::getLinks('#artList .article a')) // Get the links to the articles
->addStep(Http::get()) // Load the article pages
->addStep(
Html::first('article') // Extract the data
->extract([
'title' => 'h1',
'date' => '.date',
'author' => '.articleAuthor'
])
->addKeysToResult()
);
foreach ($crawler->run() as $result) {
// Do something with the Result
}
You can see a very central concept are the so-called "steps". A key thing to understand, to use this library, will be how data flows through those steps.
Assuming the listing contains 3 articles, running this crawler via command line will give you an output like this:
08:57:40:123456 [INFO] Loaded https://www.example.com/robots.txt
08:57:40:234567 [INFO] Wait 0.0xs for politeness.
08:57:41:123456 [INFO] Loaded https://www.example.com/articles
08:57:41:234567 [INFO] Select links with CSS selector: #artList .article a
08:57:41:345678 [INFO] Wait 0.0xs for politeness.
08:57:42:123456 [INFO] Loaded https://www.example.com/articles/1
08:57:42:234567 [INFO] Extracted properties title, date, author from document.
08:57:42:345678 [INFO] Wait 0.0xs for politeness.
08:57:43:123456 [INFO] Loaded https://www.example.com/articles/2
08:57:43:234567 [INFO] Wait 0.0xs for politeness.
08:57:44:123456 [INFO] Loaded https://www.example.com/articles/3
08:57:44:234567 [INFO] Extracted properties title, date, author from document.
08:57:44:345678 [INFO] Extracted properties title, date, author from document.
You can see there's a lot already built-in. By default, the
HttpCrawler
uses the PoliteHttpLoader
which sticks to
the rules defined in a robots.txt
file if the requested host has one. And further it
automatically assures the crawler won't produce too much
load on the server that is being crawled, by waiting a
little between requests and the wait time depends on how
long the latest request took to be answered. This means if
the server starts to respond slower, the crawler also
waits longer between requests.
If you don't want to use those features you can use a different Loader.