A Quickstart Tutorial on PHP Generators

2024-06-05

To optimize memory usage, the crwlr/crawler library leverages PHP's Generators. If you want to write a custom step for your crawler, the step must return a Generator. Since working with generators can be a bit tricky if you're new to them, this post offers an intro on how to use them and highlights common pitfalls to avoid.

When to use Generators?

In most PHP code, generators might not be necessary and could add unnecessary complexity. However, when your application potentially deals with large amounts of data, such as in crawlers, generators can help you avoid exceeding PHP's memory limits.

The Basics

When a function returns an array like this:

function foo(): array
{
    return ['one', 'two', 'three'];
}

The array consists of three elements, but they are returned as one single return value, so the entire array is in memory at once. If the array is very large, this can become problematic.

Instead, the function could return a Generator like this:

function foo(): Generator {
    yield 'one';

    yield 'two';

    yield 'three';
}

Here, each value is yielded one by one. This allows loops to iterate over the returned (yielded) items while keeping only one item in memory at a time.

Iterating over the items of the array or Generator looks the same in both cases:

foreach (foo() as $index => $item) {
    // process $item
}

However, the memory usage is different.

A minor downside: Because only one item is in memory at a time, you can only use the function's return value by iterating over it. You can't work with foo()'s return value as if it were an array. For example, you can't access an item by index, like foo()[1].

That's the basic theory. Now, let's look at a more real-life example.

An Example That Can Consume a Lot of Memory

Imagine having something like this:

foreach (loadAllPages('https://www.example.com/list?page=1') as $loadedPage) {
    // Process the loaded page.
}

Here, https://www.example.com/list?page=1 is the first page of a paginated product listing, with pagination links to further list pages containing more products. The loadAllPages() method loads all those list pages, which could be just a few or even thousands of pages. The loadAllPages() function might look something like this:

function loadAllPages(string $firstPageUrl): array
{
    $pages = [];

    $pages[] = $currentPage = file_get_contents($firstPageUrl);

    $nextPageLink = getNextPageLink($currentPage); // returns the URL, linked in the "next page" link, or null.

    while ($nextPageLink) {
        $pages[] = $currentPage = file_get_contents($nextPageLink);

        $nextPageLink = getNextPageLink($currentPage);
    }

    return $pages;
}

As you can see, we are filling the $pages array with all the responses until all pages are loaded, resulting in a very large array that we return at the end. This approach will use a lot of memory if there are more than just a few pages.

You could rewrite the code to load and process each page individually, but with your codebase becoming more complex, this won't always be a feasible option.

Rewriting It to Use Generators

Instead, we can change the function to return a Generator:

function loadAllPages(string $firstPageUrl): Generator
{
    $currentPage = file_get_contents($firstPageUrl);

    yield $currentPage;

    while ($nextPageLink = getNextPageLink($currentPage)) {
        $currentPage = file_get_contents($nextPageLink);

        yield $currentPage;
    }
}

Instead of collecting all items in an array and returning the whole array at the end, we immediately return (or yield) each item with the yield keyword.

A common Pitfall: Yield does not Stop Code Execution!

One important piece of advice when working with generators:
Forget the return early pattern! Or at least be aware that yield doesn't stop code execution.

Imagine this function with return:

function foo(bool $foo = true): string
{
    if ($foo) {
        return 'foo';
    }

    return 'bar';
}

var_dump(foo());     // string(3) "foo"
var_dump(foo(false)) // string(3) "bar"

This function returns either foo or bar when called. If you simply replace return with yield and change the return type to Generator:

function foo(bool $foo = true): Generator
{
    if ($foo) {
        yield 'foo';
    }

    yield 'bar';
}

foreach (foo() as $returnValue) {
    var_dump($returnValue);
}

// Output:
// string(3) "foo"
// string(3) "bar"

It yields both foo and bar, because yield does not stop the function's execution.

Instead, either use else:

function foo(bool $foo = true): Generator
{
    if ($foo) {
        yield 'foo';
    } else {
        yield 'bar';
    }
}

Or if the function is more than a simple if/else, you can also add a return; after yielding a value:

function getAboveFiveAndStopWhenAboveTen(array $numbers): Generator
{
    foreach ($numbers as $number) {
        if ($number > 10) {
            yield $number;

            return;
        } elseif ($number > 5) {
            yield $number;
        }
    }
}

foreach (getAboveFiveAndStopWhenAboveTen([3, 6, 4, 13, 7]) as $returnValue) {
    var_dump($returnValue);
}

// Prints:
// int(6)
// int(13)

That's it!

In the end, generators aren't rocket science and are quite easy to use once you are aware of the pitfalls. Now, you can apply what you've learned and write your own custom steps to extend the crawler library for your purposes.