The Crawler
As pointed out on the getting started page, the first
thing you need to do to build a crawler, is creating a class
extending the Crawler
(or HttpCrawler
) class.
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\UserAgents\BotUserAgent;
use Crwlr\Crawler\UserAgents\UserAgentInterface;
class MyCrawler extends HttpCrawler
{
protected function userAgent(): UserAgentInterface
{
return BotUserAgent::make('MyBot');
}
}
The minimum, the HttpCrawler
requires you to define is a user
agent. You can read more about user agents here.
The Crawler
class also requires you to define a loader. The
HttpCrawler
by default uses the PoliteHttpLoader
. Read more
about loaders here.
Another dependency for crawlers is a logger. It takes any
implementation of the PSR-3 LoggerInterface
and by default uses the CliLogger
shipped with the package. Read more about loggers here.
Configuring a Crawler Procedure
When your crawler class is set up, you can instantiate it and start configuring the procedure it should run:
$myCrawler = new MyCrawler();
// Provide initial input, add steps and finally run it.
Provide initial input
You can provide a single initial input by using the input()
method:
$myCrawler->input('https://www.crwlr.software/packages');
Or provide multiple initial inputs by calling input()
multiple times:
$myCrawler->input('https://www.crwlr.software/packages/url');
$myCrawler->input('https://www.crwlr.software/packages/crawler');
Or provide multiple initial inputs as an array, by using the
inputs()
method:
$myCrawler->inputs([
'https://www.crwlr.software/packages/url',
'https://www.crwlr.software/packages/crawler',
]);
The inputs()
method also adds additional inputs, so you
can use both methods multiple times and nothing will get
lost.
Add steps
Steps are the central building blocks for your crawlers. To understand how the data flows through the steps of your crawler, read Steps and Data Flow. Check out the Included Steps to see what the steps included in the package can do for you. If you need to build your own custom step, read this.
To add a step to your crawler, simply use the addStep()
method:
$myCrawler->addStep(Http::get());
The method returns the crawler instance itself, so you can
also chain addStep()
calls:
$myCrawler->addStep(Http::get())
->addStep(Html::each('#list .item')->extract(['url' => 'a']))
->addStep(new MyCustomStep());
Getting/Handling Result Data
When you've added the steps that your crawler shall perform,
you can finally run it using one of the methods run()
or
runAndTraverse()
. One thing to know is that the Crawler
class internally uses
generators
to be as memory efficient as possible. This means you need
to iterate the Generator
that the run()
method returns,
otherwise it won't do anything.
foreach ($myCrawler->run() as $result) {
// $result is an instance of Crwlr\Crawler\Result
}
When you actually don't need to receive all the results
where you're calling the crawler (e.g. because you defined
a store) you can just use runAndTraverse()
instead:
$myCrawler->setStore(new MyStore());
$myCrawler->runAndTraverse();
Memory Usage
Crawlers typically are programs dealing with large amounts of data, which is why the library uses generators wherever possible to be as memory efficient as possible.
If your crawler still needs a bit more memory than your
current PHP config allows, the Crawler
class contains two
convenient helper methods to get the current memory limit
and set a higher limit if the php installations allows it.
Crawler::getMemoryLimit();
// Wrapper for ini_get('memory_limit'), returns a string like e.g. 512M
Crawler::setMemoryLimit('1G');
// Wrapper for ini_set('memory_limit', <value>), returns either the prev.
// limit as string or false on failure.