Getting Started
What is this Library for?
This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.
To give you an overview, here's a list of things that it helps you with:
- Crawler Politeness 😇 (respecting robots.txt, throttling,...)
- Load pages/resources (from URLs) using
- a (PSR-18) HTTP client (default is of course Guzzle)
- or a headless browser (chrome) to get source after Javascript execution
- Get absolute links from HTML documents 🔗
- Get sitemaps from robots.txt and get all URLs from those sitemaps
- Crawl (load) all pages of a website 🕷
- Use cookies (or don't) 🍪
- Use any HTTP methods (GET, POST,...) and send any headers or body
- Iterate over paginated list pages 🔁
- Extract data from:
- Extract schema.org structured data in JSON-LD format from HTML documents
- Keep memory usage low by using PHP Generators 💪
- Cache HTTP responses during development, so you don't have to load pages again and again after every code change
- Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)
- And a lot more...
What is the Difference between Crawling and Scraping
Before diving into the library, let's have a look at the terms crawling and scraping. For most real world use cases, those two things go hand in hand, which is why this library helps with and combines both.
What is a Crawler?
A (web) crawler is a program that (down)loads documents and follows the links in it to load them as well. A crawler could just load actually all links it is finding (and is allowed to load according to the robots.txt file), then it would just load the whole internet (if the URL(s) it starts with are no dead end). Or it can be restricted to load only links matching certain criteria (on same domain/host, URL path starts with "/foo",...) or only to a certain depth. A depth of 3 means 3 levels deep. Links found on the initial URLs provided to the crawler are level 1 and so on.
What is a Scraper?
A scraper extracts data from a document. Crawling only gets you the documents that you're looking for, but in most use cases you also want to extract certain data from those documents which is called scraping.
That being said: in this project the term crawling is preferred,
but most of the time it also includes scraping. The class
that you need to extend is called Crawler
but it's here for
both, crawling and scraping.
Requirements
Requires PHP version 8.1 or above.
Installation
composer require crwlr/crawler
Usage
To build your first crawler you need two things:
- First you need a crawler class where you define at least its user agent. For everything else the library has good defaults.
- And then you need to instantiate that class and define the crawling procedure by adding steps that it should perform.
Your crawler class must extend the Crawler
or better the HttpCrawler
class. As mentioned, with the HttpCrawler
class you only need to define the user agent it should identify as. You can either use a BotUserAgent
to identify as a bot, or just a normal UserAgent
which can be any browser user-agent string for example. If you use a BotUserAgent
, the HttpCrawler
will automatically load and respect the robots.txt
file for any host you're loading URLs on. You can read more about the built-in politeness features here.
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
class MyCrawler extends HttpCrawler
{
protected function userAgent(): UserAgentInterface
{
return BotUserAgent::make('MyBot');
}
}
A very simple example for a crawling procedure, to extract data from some articles linked on list pages, looks like this:
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler = new MyCrawler();
$crawler->input('https://www.example.com/articles')
->addStep(Http::get()) // Load the listing page
->addStep(Html::getLinks('#artList .article a')) // Get the links to the articles
->addStep(Http::get()) // Load the article pages
->addStep(
Html::first('article') // Extract the data
->extract([
'title' => 'h1',
'date' => '.date',
'author' => '.articleAuthor'
])
);
foreach ($crawler->run() as $result) {
// Do something with the Result
}
You can see a very central concept are the so-called "steps". A key thing to understand, to use this library, will be how data flows through those steps.
Assuming the listing contains 3 articles, running this crawler via command line will give you an output like this:
08:57:40:123456 [INFO] Loaded https://www.example.com/robots.txt
08:57:41:123456 [INFO] Loaded https://www.example.com/articles
08:57:42:123456 [INFO] Loaded https://www.example.com/articles/1
08:57:43:123456 [INFO] Loaded https://www.example.com/articles/2
08:57:44:123456 [INFO] Loaded https://www.example.com/articles/3
The final crawling results that the run()
method returns are wrapped in the Result
class. You can get a certain result property via its get()
method, or just use the toArray()
method to get the result as an array.
foreach ($crawler->run() as $result) {
$title = $result->get('title');
// or
$resultArray = $result->toArray();
// array(3) {
// ["title"]=>
// string(10) "Some Title"
// ["date"]=>
// string(10) "2022-09-30"
// ["author"]=>
// string(15) "Christian Olear"
// }
}
For more convenient processing of results, have a look at the stores feature.