HTTP Steps
The Http
step uses the LoadingStep
trait and automatically receives the crawler's Loader when it is added to the crawler.
HTTP Requests
There are static methods to get steps for all the different HTTP methods:
use Crwlr\Crawler\Steps\Loading\Http;
Http::get();
Http::post();
Http::put();
Http::patch();
Http::delete();
They all have optional parameters for headers, body (if available for method) and HTTP version:
use Crwlr\Crawler\Steps\Loading\Http;
Http::get(array $headers = [], string $httpVersion = '1.1');
Http::post(
array $headers = [],
string|StreamInterface|null $body = null,
string $httpVersion = '1.1'
)
Http::put(
array $headers = [],
string|StreamInterface|null $body = null,
string $httpVersion = '1.1',
);
Http::patch(
array $headers = [],
string|StreamInterface|null $body = null,
string $httpVersion = '1.1',
);
Http::delete(
array $headers = [],
string|StreamInterface|null $body = null,
string $httpVersion = '1.1',
);
Getting Headers and/or Body from previous Step
By default, if the step receives array input, it will look for the keys url
or uri
to use it as the request URL. But to be as flexible as possible, the Http
steps can receive not only the URL, but also headers and a body from the outputs of a previous step. Let's say you have a MyCustomStep
that produces outputs like:
[
'link' => 'https://www.example.com',
'someHeaderValue' => '123abc',
'queryString' => 'foo=bar&baz=quz',
]
You can get those values to be used as a certain HTTP request header and as the request body, like this:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('...')
->addStep(new MyCustomStep())
->addStep(
Http::post()
->useInputKeyAsUrl('link')
->useInputKeyAsHeader('someHeaderValue', 'x-header-value')
->useInputKeyAsBody('queryString')
);
As you can see you can even map the output key to a certain header name.
You can also use an array from the output, containing multiple headers. Let's assume the output of MyCustomStep
looks like:
[
'link' => 'https://www.example.com',
'customHeaders' => [
'Accept' => 'text/html,application/xhtml+xml,application/xml',
'Accept-Encoding' => 'gzip, deflate',
],
]
In this case you can add those headers to your request like this:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('...')
->addStep(new MyCustomStep())
->addStep(
Http::post()
->useInputKeyAsUrl('link')
->useInputKeyAsHeaders('customHeaders')
);
If you're also defining some headers statically when creating the step, dynamic headers from previous step's outputs are merged with them:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('...')
->addStep(new MyCustomStep())
->addStep(
Http::get(headers: ['Accept-Language' => 'de-DE'])
->useInputKeyAsUrl('link')
->useInputKeyAsHeaders('customHeaders')
);
Watch out: usually the Http
steps receive the request URL as scalar input, or you define, which key from array input should be used by calling the step's useInputKey()
method. When you also want to get headers and/or body from the input, you have to use the useInputKeyAsUrl()
method, because when using the useInputKey()
method, all other values are just thrown away before invoking the step.
Error Responses
By default, error responses (HTTP status code 4xx and 5xx) are not passed on to the next step in the crawler. If you want to also cascade error responses down to the next step, you can call the yieldErrorResponses()
method:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/broken-link')
->addStep(
Http::get()->yieldErrorResponses()
)
->addStep(...);
Another default behavior is, that crawlers keep on crawling after error responses (except for some special behaviour in case of a 429 HTTP response, see the Politeness page). If it's important for your crawler that none of the requests fail, call the stopOnErrorResponse()
method, and the step will throw a LoadingException
in case it receives an error response.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/broken-link')
->addStep(
Http::get()->stopOnErrorResponse()
)
->addStep(...);
Directly adding Response Data to the Result
After an HTTP request step, usually you'll have some step that extracts data from that response document. If you directly want to add some property from the response to the crawling result, you can use the output keys url
, status
, headers
and body
:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com')
->addStep(
Http::get()
->keep(['url', 'status', 'headers', 'body'])
);
Paginating List Pages
A typical challenge when crawling, is listings with items spread over multiple pages. A convenient way to solve this are Paginators. Here's a simple example how to use it:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/some/listing')
->addStep(
Http::get()->paginate('#pages')
);
As first argument, the paginate()
method either takes a CSS selector string or an instance of the Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator
class. With a CSS selector, the method creates an instance of the SimpleWebsitePaginator
class. You can either use a CSS selector to select the link to the next page, or just the element containing all the pagination links. The SimpleWebsitePaginator
remembers all the URLs it already loaded, so it won't load any link twice. But keep in mind, that pages may not be loaded in the correct order, when selecting a pagination wrapper element.
As the second argument, the paginate()
method takes the maximum number of pages it will load. The default value if you don't provide a value yourself, is 1000.
Query Params Paginator
Another paginator implementation shipped with the package is the Crwlr\Crawler\Steps\Loading\Http\Paginators\QueryParamsPaginator
. It automatically increases or decreases values of query parameters, either in the URL or in the request body (e.g. with POST requests).
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;
$crawler = new MyCrawler();
$crawler
->input('https://www.example.com/list')
->addStep(
Http::post(body: 'page=1&offset=0')
->paginate(
Paginator::queryParams()
->inBody() // or ->inUrl() when working with URL query params
->increase('page')
->increase('offset', 20)
)
);
In this example, the page
query parameter is increase by one (default increase value) after each request, and the offset
parameter is increased by 20. You also have the option to decrease parameter values as needed using the decrease()
method.
If you're dealing with a nested query string like pagination[page]=1&pagination[size]=25
, you can use dot notation to define the query param to increase or decrease:
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;
$crawler = new MyCrawler();
$crawler
->input('https://www.example.com/list?pagination[page]=1&pagination[size]=25')
->addStep(
Http::get()
->paginate(
Paginator::queryParams()
->inUrl()
->increase('pagination.page')
)
);
However, the issue with this example is that it continuously sends requests until it reaches the default limit of 1000 requests (you can customize this limit by specifying it as a method argument: Paginator::queryParams(300)
). What we want to do here, is to provide the paginator with a rule that determines when it should stop loading further pages as a reaction to received responses:
Paginator Stop Rules
Suppose the example.com/list
endpoint returns a JSON list of books, with book items stored in data.books
. When we reach the end of the list, data.books
becomes empty, and we want stop loading any further pages. To achieve this, we can do:
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;
use Crwlr\Crawler\Steps\Loading\Http\Paginators\StopRules\PaginatorStopRules;
$crawler = new MyCrawler();
$crawler
->input('https://www.example.com/list?page=1')
->addStep(
Http::get()
->paginate(
Paginator::queryParams()
->inUrl()
->increase('page')
->stopWhen(PaginatorStopRules::isEmptyInJson('data.books'))
)
);
As you can see, you can define a so-called stop rule through the stopWhen()
method. These stop rules are applicable to any paginator, as they are implemented in the AbstractPaginator
class. The package includes several pre-defined stop rules, such as:
PaginatorStopRules::isEmptyResponse()
// Paginator stops when response body is empty.
PaginatorStopRules::isEmptyInJson('data.items')
// Paginator stops when response is empty, or data.items doesn't exist or is empty in JSON response.
PaginatorStopRules::isEmptyInHtml('#search .list .item')
// Paginator stops when response is empty, or the CSS selector `#search .list .item` does not select any nodes.
PaginatorStopRules::isEmptyInXml('channel item')
// Paginator stops when response is empty, or the CSS selector `channel item` does not select any nodes.
PaginatorStopRules::contains('a specific string')
// Paginator stops when response is empty, or the response body contains a specific string.
PaginatorStopRules::notContains('a specific string')
// Paginator stops when response is empty, or the response body does not contain a specific string.
If your use case requires unique criteria, you can also supply a custom Closure.
use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginator;
use Psr\Http\Message\RequestInterface;
$crawler = new MyCrawler();
$crawler
->input('https://www.example.com/list?page=1')
->addStep(
Http::get()
->paginate(
Paginator::queryParams()
->inUrl()
->increase('page')
->stopWhen(function (RequestInterface $request, ?RespondedRequest $respondedRequest) {
// Based on the $request and the $respondedRequest object provided to the callback
// you can decide if the paginator should stop. In this case, return true.
return true;
})
)
);
Custom Paginators
If the paginators shipped with the package don't fit your needs, you can write your own. A Paginator has to extend the Crwlr\Crawler\Steps\Loading\Http\AbstractPaginator
class and implement at least a custom getNextRequest()
method. Optionally you can also implement your custom versions of the methods processLoaded()
, hasFinished()
and logWhenFinished()
.
use Crwlr\Crawler\Loader\Http\Messages\RespondedRequest;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\UriInterface;
use Psr\Log\LoggerInterface;
class CustomPaginator extends AbstractPaginator
{
public function getNextRequest(): ?RequestInterface
{
// Let's say we paginate URLs with a path like this: /list-of-things/<pageNumber>
$latestRequestUrlPath = $this->latestRequest->getUri()->getPath();
$prevPageNumber = explode('/list-of-things/', $latestRequestUrlPath);
if (count($prevPageNumber) < 2) {
return null;
}
$nextPageNumber = ((int) $prevPageNumber[1]) + 1;
return $this->latestRequest->withUri(
$this->latestRequest->getUri()->withPath('/list-of-things/' . $nextPageNumber)
);
}
}
You can then use your Paginator class like this:
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Paginators\StopRules\PaginatorStopRules;
$crawler
->input('https://www.example.com/list-of-things/1')
->addStep(
Http::get()
->paginate(new CustomPaginator())
->stopWhen(PaginatorStopRules::isEmptyInHtml('#results .item'))
);
Crawling (whole Websites)
If you want to crawl a whole website the Http::crawl()
step is for you. By default, it just follows all the links it finds until everything on the same host is loaded.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/')
->addStep(Http::crawl());
Depth
You can also tell it to only follow links to a certain depth.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler->input('https://www.example.com/')
->addStep(
Http::crawl()
->depth(2)
);
This means, it will load all the URLs it finds on the page from the initial input (in this case https://www.example.com/
), then all the links it finds on those found links, and then it'll stop. With a depth of 3
it will load another level of newly found links.
Start with a sitemap
By using the inputIsSitemap()
method, you can start crawling with a sitemap.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler->input('https://www.example.com/sitemap.xml')
->addStep(
Http::crawl()
->inputIsSitemap()
);
The crawl step usually assumes that all input URLs will deliver HTML documents, so if you want to start crawling with a sitemap, the call to this method is necessary.
Load URLs on the same domain (instead of host)
As mentioned, by default, it loads all the pages on the same host. So, for example www.example.com
. If there's a link to https://jobs.example.com/foo
, it won't follow that link, as it is on jobs.example.com
. But you can tell it to also load all URLs on the same domain, using the sameDomain()
method:
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->sameDomain()
);
Only load URLs matching path criteria
There's two methods that you can use to tell it, to only load URLs with certain paths:
pathStartsWith()
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->pathStartsWith('/foo/')
);
In this case it will only load found URLs where the path starts with /foo/
, so for example: https://www.example.com/foo/bar
, but not https://www.example.com/other/bar
.
pathMatches()
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->pathMatches('/\/bar\//')
);
The pathMatches()
method takes a regex to match the paths of found URLs. So in this case it will load all URLs containing /bar/
anywhere in the path.
Custom Filtering based on URL or Link Element
The customFilter()
method allows you to define your own callback function that will be called with any found URL or link:
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->customFilter(function (Url $url) {
return $url->scheme() === 'https';
})
);
So, this example will only load URLs where the URL scheme is https
.
In case the URL was found in an HTML document (not in a sitemap), the Closure also receives the link element as a Crwlr\Crawler\Steps\Dom\HtmlElement
instance as the second argument:
use Crwlr\Crawler\Steps\Dom\HtmlElement;
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Url\Url;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->customFilter(function (Url $url, ?HtmlElement $linkElement) {
return $linkElement && str_contains($linkElement->text(), 'Foo');
})
);
So, this example will only load links when the link text contains Foo
.
Load all URLs but yield only matching
When restricting crawling e.g. to only paths starting with /foo/
, it will only load matching URLs (after the initial input URL). So if some page /some/page
contains a link to /foo/quz
, the link won't be found, because the /some/page
won't be loaded. If you want to find all links matching your criteria, on the whole website, but yield only the responses of the matching URLs, you can use the loadAllButYieldOnlyMatching()
method.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/')
->addStep(
Http::crawl()
->pathStartsWith('/foo')
->loadAllButYieldOnlyMatching()
);
This works for restrictions defined using the path methods (pathStartsWith()
and pathMatches()
) and also for the customFilter()
method. Of course, it doesn't affect depth or staying on the same host or domain.
Use Canonical Links
If a websites delivers the same content via multiple URLs (for example like example.com/products?productId=123 and example.com/products/123), it can use canonical links to tell crawlers if a page is a duplicate of another one and which one is the main URL. If you want to avoid loading the same document multiple times, you can tell the Http::crawl()
step to use canonical links, calling its useCanonicalLinks()
method.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/foo')
->addStep(
Http::crawl()
->useCanonicalLinks()
);
Calling that method, the step will not yield responses if its canonical link URL was already yielded before. If it discovers a link, and some document pointing to that URL via canonical link was already loaded, the newly discovered link is treated as if it was already loaded. Further this feature also sets the canonical link URL as the effectiveUri
of the response.
Keep URL Fragments
By default, the Http::crawl()
step throws away the fragment part of all discovered URLs (example.com/path#fragment => example.com/path), because websites only very rarely respond with different content based on the fragment part. If a site that you're crawling does so, you can tell the step to keep the URL fragment, calling the keepUrlFragment()
method.
use Crwlr\Crawler\Steps\Loading\Http;
$crawler
->input('https://www.example.com/something')
->addStep(
Http::crawl()
->keepUrlFragment()
);
Post Browser Navigate Hooks
If your crawler’s loader uses the headless browser, the postBrowserNavigateHook()
method allows you to define a callback function that runs immediately after navigating to the target URL but before reading the HTML source code. This allows you to interact with the loaded page before reading the state of the source code.
use Crwlr\Crawler\HttpCrawler;
use Crwlr\Crawler\Steps\Loading\Http;
use HeadlessChromium\Page;
$crawler = HttpCrawler::make()->withBotUserAgent('MyBot');
$crawler->getLoader()->usesHeadlessBrowser();
$crawler
->input('https://www.example.com/foo')
->addStep(
Http::get()
->postBrowserNavigateHook(function (Page $page) {
$page->mouse()->find('#some_element')->click();
}),
);
This example returns a response with the HTML source code status after programmatically clicking the element that matches the CSS selector #some_element
. The Page object is part of the chrome-php/chrome
library, and you can find its documentation here.
To make things even easier, the Crwlr\Crawler\Steps\Loading\Http\Browser\BrowserAction
class provides several pre-built callback functions for this purpose.
use Crwlr\Crawler\Steps\Loading\Http;
use Crwlr\Crawler\Steps\Loading\Http\Browser\BrowserAction;
// Wait until the page contains an element matching the selector.
Http::get()
->postBrowserNavigateHook(
BrowserAction::waitUntilDocumentContainsElement('#some_element'),
);
// Click an element matching the selector.
Http::get()
->postBrowserNavigateHook(
BrowserAction::clickElement('#some_element'),
);
// Click an element matching the selector and wait for a page reload
// (e.g., if the clicked element is a link).
Http::get()
->postBrowserNavigateHook(
BrowserAction::clickElementAndWaitForReload('#some_element'),
);
// Run some JS code on the loaded page.
Http::get()
->postBrowserNavigateHook(
BrowserAction::evaluate('document.getElementById("some_element").innerHTML = \'Hello\''),
);
// Run some JS code on the loaded page and wait for a page reload.
Http::get()
->postBrowserNavigateHook(
BrowserAction::evaluateAndWaitForReload('document.location.href = \'https://www.example.com/bar\''),
);
// Wait for a specified number of seconds.
Http::get()
->postBrowserNavigateHook(
BrowserAction::wait(2.5),
);