Steps and Data Flow
Steps are the fundamental building blocks for your crawlers. There are many ready-to-use steps available, and you can also create custom ones. Crawlers accept any class that implements the StepInterface
as a step.
When a crawler is run, it calls one step after another with some input. Typically, you manually define the initial inputs for the first step. Most often, this will be one or more URLs that need to be fetched.
$crawler = new MyCrawler();
$crawler->inputs([
'https://www.crwlr.software/packages/url',
'https://www.crwlr.software/packages/robots-txt'
]);
Any step can produce one, zero, or multiple outputs from a single input it receives.
Subsequent steps added to the crawler are called with the outputs of the previous step as their inputs.
So, the data (inputs and outputs) flows through the crawling procedure, cascading from one step to the next.
Example
use Crwlr\Crawler\Steps\Html;
use Crwlr\Crawler\Steps\Loading\Http;
$crawler = new MyCrawler();
$crawler->inputs([
'https://www.crwlr.software/packages/url',
'https://www.crwlr.software/packages/robots-txt'
]);
$crawler->addStep(Http::get())
->addStep(Html::getLinks('#versions a'))
->addStep(Http::get())
->addStep(
Html::first('article')
->extract(['title' => 'h1'])
);
foreach ($crawler->run() as $result) {
// do something with result
}