Defining a Crawling Procedure
When your crawler class is set up, you can instantiate it and start configuring the procedure it should run:
$myCrawler = new MyCrawler();
// Provide initial input, add steps and finally run it.
Provide initial Input
You can provide a single initial input by using the input()
method:
$myCrawler->input('https://www.crwlr.software/packages');
Or provide multiple initial inputs by calling input()
multiple times:
$myCrawler->input('https://www.crwlr.software/packages/url');
$myCrawler->input('https://www.crwlr.software/packages/crawler');
Or provide multiple initial inputs as an array, by using the
inputs()
method:
$myCrawler->inputs([
'https://www.crwlr.software/packages/url',
'https://www.crwlr.software/packages/crawler',
]);
The inputs()
method also adds additional inputs, so you
can use both methods multiple times and nothing will get
lost.
Add Steps
Steps are the central building blocks for your crawlers. To understand how the data flows through the steps of your crawler, read Steps and Data Flow. Check out the Included Steps to see what the steps included in the package can do for you. If you need to build your own custom step, read this.
To add a step to your crawler, simply use the addStep()
method:
use Crwlr\Crawler\Steps\Loading\Http;
$myCrawler->addStep(Http::get());
The method returns the crawler instance itself, so you can
also chain addStep()
calls:
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$myCrawler->addStep(Http::get())
->addStep(Html::each('#list .item')->extract(['url' => 'a']))
->addStep(new MyCustomStep());
Choosing a Key from Array Input
When the output from a previous step is an array but the next step needs only a certain element from that array as its input, you can choose that array key by using the Step::useInputKey()
method.
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$myCrawler->addStep(Http::get())
->addStep(
Html::each('#list .item')
->extract([
'title' => 'a.title',
'url' => Dom::cssSelector('a.title')->attribute('href')->toAbsoluteUrl(),
])
->addToResult()
)
->addStep(
Http::get()->useInputKey('url')
);
The Html
step produces array outputs like ['title' => '...', 'url' => '...']
and the following Http::get()
step uses only the url
from those arrays as its input.
Getting/Handling Result Data
When you've added the steps that your crawler shall perform,
you can finally run it using one of the methods run()
or
runAndTraverse()
. One thing to know is that the Crawler
class internally uses
generators
to be as memory efficient as possible. This means you need
to iterate the Generator
that the run()
method returns,
otherwise it won't do anything.
foreach ($myCrawler->run() as $result) {
// $result is an instance of Crwlr\Crawler\Result
}
When you actually don't need to receive all the results
where you're calling the crawler (e.g. because you defined
a store) you can just use runAndTraverse()
instead:
$myCrawler->setStore(new MyStore());
$myCrawler->runAndTraverse();
Memory Usage
Crawlers typically are programs dealing with large amounts of data, which is why the library uses generators wherever possible to be as memory efficient as possible.
If your crawler still needs a bit more memory than your
current PHP config allows, the Crawler
class contains two
convenient helper methods to get the current memory limit
and set a higher limit if the php installations allows it.
use Crwlr\Crawler\Crawler;
Crawler::getMemoryLimit();
// Wrapper for ini_get('memory_limit'), returns a string like e.g. 512M
Crawler::setMemoryLimit('1G');
// Wrapper for ini_set('memory_limit', <value>), returns either the prev.
// limit as string or false on failure.
If you think your crawler is consuming too much memory, you can also monitor its memory usage while it's running via log messages:
$crawler->monitorMemoryUsage();
It will then print a log message, telling you the current memory usage in bytes (using memory_get_usage()) after every step invocation with one input.
The method also has one parameter that you can use to tell it to only log messages when the usage exceeds X bytes:
$crawler->monitorMemoryUsage(1000000000);