PHP Composer Packages for Crawler and Scraper Development
crwlr.software is a collection of open source PHP composer packages that provide the necessary tools to build web crawlers and scrapers. The crawler package contains everything and helps you build crawlers as fast as possible. There are also sub-packages that you can use standalone.
Packages
crwlr / crawler
The main package of this collection, providing kind of a framework and a lot of ready to use, so-called steps, that you can use to build your own web crawlers and scrapers with.
crwlr / url
The Swiss Army knife for urls. Parses urls to components (scheme, host, domain, path,...). You can access and modify url components, compare components of different urls and resolve relative to absolute urls. Also supports internationalized domain names.
crwlr / query-string
This library provides a very convenient API to create, access and manipulate query strings used in HTTP GET (as part of the URL) or POST (as part of the body) requests.
crwlr / robots-txt
Use this library within crawler and scraper programs to parse robots.txt files and check if your crawler user-agent is allowed to load certain paths.
crwlr / schema-org
This library finds schema.org structured data in JSON-LD format in HTML documents and converts them to PHP classes representing those schema.org objects.
crwlr / html-2-text
This very easy-to-use package, helps you to convert HTML to well formatted plain text.
Crawler Extension Packages
crwlr / crawler-ext-browser
This extension package for the crwlr/crawler library enables the utilization of a headless browser for advanced functionalities beyond loading pages and getting the HTML after rendering it.
Latest Blog Posts
Crawler v1.8: Paving the Way to a Better v2.0
2024-06-05
Version 1.8 of the crwlr/crawler package is out now, featuring important new functions that will replace existing ones in v2.0. There was one problem that I sometimes received negative feedback about and that I was unhappy with myself: the way composing crawling result data worked. I have now found a solution that I am quite happy with. The new functionality will lead to better performance, further minimized memory usage, and will hopefully be a lot easier to understand.
» Read moreA Quickstart Tutorial on PHP Generators
2024-06-05
To optimize memory usage, the crwlr/crawler library leverages PHP's Generators. If you want to write a custom step for your crawler, the step must return a Generator. Since working with generators can be a bit tricky if you are new to them, this post offers an intro on how to use them and highlights common pitfalls to avoid.
» Read moreCrwlr Recipes: How to Scan any Website for schema.org Structured Data Objects
2023-11-16
This is the first article of our "Crwlr Recipes" series, providing a collection of thoroughly explained code examples for specific crawling and scraping use-cases. This first article describes how you can crawl any website fully (all pages) and extract the data of schema.org structured data objects from all its pages, with just a few lines of code.
» Read more10 good Reasons to use the crwlr Library
2023-02-08
I'm very proud to announce that version 1.0 of the crawler package is finally released. This article gives you an overview of why you should use this library for your web crawling and scraping jobs.
» Read moreWhat's new in crwlr / crawler v0.6?
2022-10-03
Version 0.6 is probably the biggest update so far with a lot of new features and steps from crawling whole websites, over sitemaps to extracting metadata and schema.org structured data from HTML. Here is an overview of all the new stuff.
» Read moreWhat's new in crwlr / crawler v0.5?
2022-09-03
We're already at v0.5 of the crawler package and this version comes with a lot of new features and improvements. Here's a quick overview of what's new.
» Read moreDealing with HTTP (Url) Query Strings in PHP
2022-06-02
There is a new package in town called query-string. It allows to create, access and manipulate query strings for HTTP requests in a very convenient way. Here's a quick overview of what you can do with it and also how it can be used via the url package.
» Read moreWhat's new in crwlr / crawler v0.4
2022-05-10
Last friday version 0.4 of the crawler package was released with some pretty useful improvements. Read what's shipped with this new minor update.
» Read moreWhat's new in crwlr / crawler v0.2 and v0.3
2022-04-30
There are already two new 0.x versions of the crawler package. Here a quick summary of what's new in versions 0.2 and 0.3.
» Read moreRelease of crwlr / crawler v0.1.0
2022-04-18
After months of hard work, today I'm finally releasing the first version (v0.1.0) of the crwlr / crawler package. Here some information on what it is, its state and current and future features.
» Read morePrevent Homograph Attacks using the crwlr / url Package
2022-01-19
Homograph attacks are using internationalized domain names (IDN) for malicious links including domains that look like trusted organizations. You can use the crwlr Url class to detect and monitor urls containing IDNs in your user's input.
» Read moreWhy I start crwlr.software
2018-04-15
This is just a short introduction to what crwlr.software is and will become in the future and why you may like it.
» Read more