The Crawler
As pointed out on the getting started page, the first thing you need to do to build a crawler, is creating a class extending the Crawler
or better the HttpCrawler
class.
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
class MyCrawler extends HttpCrawler
{
protected function userAgent(): UserAgentInterface
{
return BotUserAgent::make('MyBot');
}
}
The minimum, the HttpCrawler
requires you to define is a user agent. The Crawler
class also requires you to define a loader. The HttpCrawler
by default uses the HttpLoader
. You can read more about loaders here.
User Agents
User agents are very simple. The basic UserAgentInterface
only defines that implementations need to have a __toString()
method.
The HttpLoader
sends that string as User-Agent HTTP Header with every request.
Bot User Agents
If you want to be polite and identify as a bot, you can use the BotUserAgent
to do so. It can be created with at least the name (product-token) of your bot, but optionally you can also add a URL where you provide infos about your crawler and also a version number.
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
class MyCrawler extends HttpCrawler
{
protected function userAgent(): UserAgentInterface
{
return new BotUserAgent('MyBot', 'https://www.example.com/my-bot', '1.2');
}
}
The toString()
method of the BotUserAgent
will return this user-agent string:
Mozilla/5.0 (compatible; MyBot/1.2; +https://www.example.com/my-bot)
Non Bot User Agents
If you are, for example, just crawling your own website to check it for broken links, or you want to see what a site returns for a certain browser user agent or other things like that, just use the UserAgent
class. You can provide any string as user agent, and it will ignore the robots.txt
file.
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\UserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
class MyCrawler extends HttpCrawler
{
protected function userAgent(): UserAgentInterface
{
return new UserAgent(
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0'
);
}
}
Simple Crawler Instance Shortcut
It may seem unnecessary to create a class just to define your user agent. Therefore, if you don’t need to customize anything else in your crawler, you can create your crawler instance with a single line of code:
use Crwlr\Crawler\HttpCrawler;
$crawler = HttpCrawler::make()->withBotUserAgent('MyCrawler');
// or to get an instance with a normal (non bot) user agent
$crawler = HttpCrawler::make()->withUserAgent('Mozilla/5.0 (Macintosh,...) ...');
Loggers
Another dependency for crawlers is a logger. It takes any implementation of the PSR-3 LoggerInterface and by default uses the CliLogger
shipped with the package, which just echoes the log lines.
To use your own logger, just define the protected logger()
method in your crawler:
use Crwlr\Crawler\HttpCrawler;
use Psr\Log\LoggerInterface;
class MyCrawler extends HttpCrawler
{
protected function logger(): LoggerInterface
{
return new MyLogger();
}
// user agent...
}
The logger()
method is called only once in the constructor of the Crawler class, and then the logger instance is automatically handed over to every step that you add to the crawler.
Some included steps log some information about what they are doing or if there was any problem or error. In your custom steps you can use the logger via $this->logger
. That's the same in all callbacks that are bound to a step, like updateInputUsingOutput()
in
groups.