Getting Started
The robots-txt package provides a Parser for the Robots Exclusion Standard/Protocol. You can use this library within crawler/scraper programs to parse robots.txt files and check if your crawler user-agent is allowed to load certain paths. And further you can get all the sitemap URLs listed in the file.
Requirements
Requires PHP version 8.0 or above.
Installation
Install the latest version with:
composer require crwlr/robots-txt
Usage
use Crwlr\RobotsTxt\RobotsTxt;
$robotsTxtContent = file_get_contents('https://www.crwlr.software/robots.txt');
$robotsTxt = RobotsTxt::parse($robotsTxtContent);
$robotsTxt->isAllowed('/packages', 'MyBotName');
You can also check with an absolute url.
But attention: the library won't (/can't) check if the host of your
absolute url is the same as the robots.txt file was on (because it
doesn't know the host where it's on, you just give it the content).
$robotsTxt->isAllowed('https://www.crwlr.software/packages', 'MyBotName');
In robots.txt files site-owners can use a wildcard (*
) user-agent, when a rule should be valid for any user-agent. If you want to know if some page is explicitly disallowed for your user-agent, use the isExplicitlyNotAllowedFor()
method.
$robotsTxt->isExplicitlyNotAllowedFor('/some/path', 'MyBotName');
Sitemaps
To get all the sitemap URLs referenced in the robots.txt file just call the sitemaps()
:
$robotsTxt->sitemaps();
// array(1) {
// [0]=>
// string(38) "https://www.crwlr.software/sitemap.xml"
// }