© Copyright 1997 The Perl Journal. Reprinted with permission.
Many of the objects we deal with in programming are at least conceptually infinite---the input from the Associated Press newswire, for example, or the log output from a web server, or the digits of pi. There's a general principle in programming that you should model things as simply and as straightforwardly as possible, so that if an object is infinite, you should model it as being infinite, with an infinite data structure.
Of course, you can't have an infinite data structure, can you? After all, the computer only has a finite amount of memory. But that doesn't matter. We're all mortal, and so we, and our programs, wouldn't really know an infinite data structure if we saw one. All that's really necessary is to have a data structure that behaves as if it were infinite.
A Unix pipe is a great example of such an object---think of a pipe that happens to be connected to the standard output of the yes program. From the man page:
yes prints the command line arguments, separated by spaces and followed by a newline, forever until it is killed.
The output of yes might not be infinite, but it's a credible imitation. So is the output of tail -f /var/log/syslog.
In this article I'll demonstrate a Perl data structure, the Stream, that behaves as if it were infinite. You can keep pulling data out of this data structure, and it might never run out. Streams can be filtered, just like Unix data streams can be filtered with grep, and they can be transformed and merged, just like Unix streams. Programming with streams is a lot like programming with pipelines in the shell---you can construct a simple stream, then transform and filter it to get the stream you really want. This means that if you're used to programming with pipelines, programming with streams can feel very familiar.
As an example of a problem that's easy to solve with streams, we'll look at:
Hamming wants an efficient algorithm that generates the list, in ascending order, of all numbers of the form 2i3j5k for i,j,k at least 0. This list is called the Hamming sequence. The list begins like this:
1 2 3 4 5 6 8 9 10 12 15 16 18 ...
Just for concreteness, let's say we want the first three thousand of these. This problem was popularized by Edsger Dijkstra.
There's an obvious brute force technique: Take the first number you haven't checked yet, divide it by 2's, 3's and 5's until you can't do that any more, and if you're left with 1, then the number should go on the list; otherwise throw it away and try the next number. So:
This obvious technique has one problem: it's unbelievably slow. The problem is that most numbers aren't on the list, and you waste an immense amount of time discovering that. Although the numbers at the beginning of the list are pretty close together, the 2,999th number in the list is 278,628,139,008. Even if you had enough time to wait for the brute-force algorithm to check all the numbers up to 278,628,139,008, think how much longer you'd have to wait for it to finally find the 3,000 number in the sequence, which is 278,942,752,080.
It can be surprisingly difficult to solve this problem efficiently with conventional programming techniques. But it turns out to be easy with the techniques in this article.
A stream is like the stream that comes out of a garden hose, except that instead of water coming out, data items come out, one after the other. The stream is like a source for data. Whenever you need another data item, you can pull one out of the stream, which will keep producing data on demand forever, or until it runs out. The key point is that unlike an array, which has all the data items stored away somewhere, the stream computes the data just as they're needed, at the moment your program asks for them, so that it never takes any more space or time than necessary. You can't have an array of all the odd integers, because it would have to be infinitely long and consume an infinite amount of memory. But you can have a stream of all the odd integers, and pull as many odd integers out of it as you need, because it only computes the odd numbers one at a time as you ask for them.
We'll return to Hamming's problem a little later, when we've seen streams in more detail.
Now, unlike a Perl list, a stream is more like a linked list, which means that it is made of nodes. Each node has two parts: The head, which contains a data item at the front of the stream, and the tail, which points to the next node in the stream. In Perl, we'll implement this as a hash with two members. If $node is such a hash, then $node{h} will be the head, and $node{t} will be the tail. The tail will usually be a reference to another such node. A stream will be a long linked list of these nodes, like this:
The stream ('foo', 3, 'bar', ...).
Now we still have the problem of how to have an infinite stream, because clearly we can't construct an infinite number of these nodes. But here's the secret: a stream node might not have a tail---the tail might not have been computed yet. If a stream doesn't have a tail, it has a promise instead. The promise is a promise from the program to you. The program promises to compute the next node if you ever need the data item that would be in the head of the next node:
How can we program a promise? Perl doesn't have promises, right? But it has something like them. Here's how to make a promise to compute an expression:
$promise = sub { EXPRESSION };
Perl doesn't compute the value of the expression right away; instead it constructs an anonymous function which will compute the expression and return the value when we call the function:
$value = &$promise; # Evaluate EXPRESSION
That's just what we want. When we want to promise to compute something without computing it, we'll just wrap it up in an anonymous function, and then when we want to collect on the promise, we'll call the function.
How can we tell when a value is a promise? In our simple examples, we'll just look to see if it's a reference to a function:
if (ref $something eq CODE) { # It's a promise... }
In a real project, we might do something a little more elaborate, like inventing a Promise package with Promise objects, but in this article, we'll just stick with plain vanilla CODE refs.
Here's a simple function to construct a stream node. It expects two arguments, a head and a tail. The tail argument should either be another stream, or it should be a promise to compute one. It then takes the head and the tail, puts them into an anonymous hash with h and t members, and blesses the hash into the Stream package:
package Stream; sub new { my ($package, $head, $tail) = @_; bless { h => $head, t => $tail } => $package; }
The head method to return the head of a stream is easy to implement now. We just return the h member from the hash:
sub head { $_[0]{h} }
The tail method for returning the tail of a stream is a little more complicated because it has to deal with two possibilities: If the tail of the stream is another stream, tail can return it right away. But if the tail is a promise, then the tail function must collect on the promise and compute the real tail before it can return it.
sub tail { my $tail = $_[0]{t}; if (ref $tail eq CODE) { # It's a promise $_[0]{t} = &$tail(); # Collect on the promise } $_[0]{t}; }
We should also have a notation for an empty stream, or for a stream that has run out of data, just in case we want finite streams as well as infinite ones. If a stream is empty, we'll represent it with a node that is missing the usual h and t members, and which instead has an e member, to show that it's empty. Here's a function to construct an empty stream:
sub empty { my $pack = ref(shift()) || Stream; bless {e => 'I am empty.'} => $pack; }
And here's a function that tells you whether a stream is empty or not:
sub is_empty { exists $_[0]{e} }
These functions, and all the other functions in this article, are available in http://www.plover.com/~mjd/perl/Stream/Stream.pm.
Let's see an example of how to use this. Here is a function that constructs an interesting stream: You give it a reference to a function, $f, and a number, $n, and it constructs the stream of all numbers of the form f(n), f(n+1), f(n+2), ...
sub tabulate { my $f = shift; my $n = shift; Stream->new(&$f($n), sub { &tabulate($f, $n+1) } ) }
How does it work? The first element of the stream is just f(n), which in Perl notation is &$f($n).
Rather than computing all the rest of the elements of the table (there are an infinite number of them, after all) this function promises to compute more if we want them. The promise is the
sub { &tabulate($f, $n+1) }
part; it's a function, which, if invoked, will call tabulate again, to compute all the values from $n+1 on up. Of course, it won't really compute all the values from $n+1 on up; it'll just compute f(n+1), and give back a promise to compute f(n+2) and the rest if they're needed.
Now we can do an example:
sub square { $_[0] * $[0] } $squares = &tabulate( \&square, 1);
The show utility, supplied in Stream.pm, prints out the first few elements of a stream---the first ten, if you don't say otherwise:
$squares->show; 1 4 9 16 25 36 49 64 81 100
Let's add a little debugging to tabulate so we can see better what's going on. This version of tabulate is the same as the one above, except that it prints an extra line of output just before it calls the function f:
sub tabulate { my $f = shift; my $n = shift; print STDERR "-- Computing f($n)\n"; # For debugging Stream->new(&$f($n), sub { &tabulate($f, $n+1) } ) } $squares = &tabulate( \&square, 1); -- Computing f(1) $squares->show(5); 1 -- Computing f(2) 4 -- Computing f(3) 9 -- Computing f(4) 16 -- Computing f(5) 25 -- Computing f(6) $squares->show(6); 1 4 9 16 25 36 -- Computing f(7) $squares->show(5); 1 4 9 16 25
Something interesting happened when we did show(6) up there---the stream object only called the tabulate function once, to compute the square of 7. The other 6 elements had already been computed and saved, so it didn't need to compute them again. Similarly, the second time we did show(5), the program didn't need to call tabulate at all; it had already computed and saved the first five squares and it just printed them out. Saving computed function values in this way is called memoization.
Someday, we could come along and do
$squares->show(1_000_000_000);
and the stream would compute 999,999,993 squares for us, but until we ask for them, it won't, and that saves space and time. That's called lazy evaluation.
To solve Hamming's problem, we need only one more tool, called merge. merge is a function which takes two streams of numbers in ascending order and merges them together into one stream of numbers in ascending order, eliminating duplicates. For example, merging
1 3 5 7 9 11 13 15 17 ...with
1 4 9 16 25 36 ...yields
1 3 4 5 7 9 11 13 15 16 17 19 ...
sub merge { my $s1 = shift; my $s2 = shift; return $s2 if $s1->is_empty; return $s1 if $s2->is_empty; my $h1 = $s1->head; my $h2 = $s2->head; if ($h1 > $h2) { Stream->new($h2, sub { &merge($s1, $s2->tail) }); } elsif ($h1 < $h2) { Stream->new($h1, sub { &merge($s1->tail, $s2) }); } else { # heads are equal Stream->new($h1, sub { &merge($s1->tail, $s2->tail) }); } }
Now we have enough tools to solve Hamming's problem! Here's how we'll do it. We're going to construct a stream which has the numbers we want in it. How can we do that?
We know that the first element of the Hamming sequence is 1. That's easy. The rest of the sequence is made up of multiples of 2, multiples of 3, and multiples of 5.
Let's think about the multiples of 2 for a minute. Here's the Hamming sequence, with multiples of 2 in red:
1 2 3 4 5 6 8 9 10 12 15 16 18 ...
Now here's the Hamming sequence again, with every element multiplied by 2:
2 4 6 8 10 12 16 18 20 24 30 32 36 ...
Notice how the second row of numbers contains all of the red numbers from the first row---If a number is even, and it's a Hamming number, then it's two times some other Hamming number. That means that if we had the Hamming sequence hanging around, we could multiply every number in it by 2, and that would give us all the even Hamming numbers. We could do the same thing with 3 and 5 instead of 2. By multiplying the Hamming sequence by 2, by 3, and by 5, and merging those three sequences together, we'd get a sequence that contained all the Hamming numbers that were multiples of 2, 3, and 5. That's all of them, except for 1, which we could just tack on the front. This is how we'll solve our problem.
Let's build a function that takes a stream and multiplies every element in it by a constant:
# Multiply every number in a stream `$self' by a constant factor `$n' sub scale { my $self = shift; my $n = shift; return &empty if $self->is_empty; Stream->new($self->head * $n, sub { $self->tail->scale($n) }); }
Here's the solution to the Hamming sequence problem: We use scale to scale the Hamming sequence by 2, by 3, and by 5, we merge those three streams together, and we tack a 1 on the front, and the result is the Hamming sequence:
# Construct the stream of Hamming's numbers. sub hamming { 1 my $href = \1; # Dummy reference 2 my $hamming = Stream->new( 3 1, 4 sub { &merge($$href->scale(2), 5 &merge($$href->scale(3), 6 $$href->scale(5))) }); 7 $href = \$hamming; # Reference is no longer a dummy 8 $hamming; }
Line 1 creates a reference to the scalar 1. We're not interested in this 1, but we need a reference variable around to use to refer to $hamming so that we can include it in the calls to merge. After we've defined the anonymous subroutine (lines 4--6) which uses $href, we pull a switcheroo and make $href refer to $hamming (line 7) instead of to the irrelevant 1 value.
This function works, and it's efficient:
&hamming()->show(20); 1 2 3 4 5 6 8 9 10 12 15 16 18 20 24 25 30 32 36 40
It only takes a few minutes to compute three thousand Hamming numbers, even on my dinky P75 computer.
We could make this more efficient by fixing up merge to merge three streams instead of two, but that's left as an exercise for Our Most Assiduous Reader.
The great thing about streams is that you can treat them as sources of data, and you can compute with these sources by merging and filtering data streams; these is called a data flow paradigm. If you're a Unix programmer, you're probably already familiar with the data flow paradigm, because programming with pipelines in the shell is the same thing.
Here's an example of a function, filter, that accepts one stream as an argument, filters out all the elements from it that we don't want, and returns a stream of the elements we do want---it does for streams what the Unix grep program does for pipes, or what the Perl grep function does for lists.
filters second argument is a predicate function that returns true or false depending on whether it's applied to an argument we do or don't want:
# Return a stream on only the interesting elements of $arg. sub filter { my $stream = shift; # Second argument is a predicate function that returns true # only when passed an interesting element of $stream. my $predicate = shift; # Look for next interesting element while (! $stream->is_empty && ! &$predicate($stream->head)) { $stream = $stream->tail; } # If we ran out of stream, return the empty stream. return &empty if $stream->is_empty; # Construct new stream with the interesting element at its head # and the rest of the stream, appropriately filtered, # at its tail. Stream->new($stream->head, sub { $stream->tail->filter($predicate) } ); }
Let's find perfect squares that are multiples of 5:
sub is_multiple_of_5 { $_[0] % 5 == 0 } $squares->filter(\&is_multiple_of_5)->show(6); 25 100 225 400 625 900
You could do all sorts of clever things with this:
$input->filter(sub {$_[0] =~ /PATTERN/}),the stream of input lines that matched a certain pattern.
Here's a particularly clever application: We can use filtering to compute a stream of prime numbers:
sub prime_filter { my $s = shift; my $h = $s->head; Stream->new($h, sub { $s->tail ->filter(sub { $_[0] % $h }) ->prime_filter() }); }
To use this, you apply it to the stream of integers starting at 2:
2 3 4 5 6 7 8 9 ...
The first thing it does is to pull the 2 off the front and returns that, but it also filters the tail of the stream and throws away all the elements that are divisible by 2. Then, it gets the next available element, that's 3, and returns that, and filters the rest of the stream (which was already missing the even numbers) to throw away the elements that are divisible by 3. Then it pulls the next element off the front, that's 5... and so on.
If we're going to have fun with this, we need to start it off with that stream of numbers that begins at 2:
$iota2 = &tabulate(sub {$_[0]}, 2); $iota2->show; 2 3 4 5 6 7 8 9 10 11 $primes = $iota2->prime_filter $primes->show; 2 3 5 7 11 13 17 19 23 29
This isn't the best algorithm for computing primes, but it is the oldest---it's called the Sieve of Eratosthenes and it was invented about 2,300 years ago.
Exercise for mathematically inclined readers: What's interesting about this stream:
&tabulate(sub {$_[0] * 3 + 1}, 1)->prime_filter
There are a very few basic tools that we need to make good use of streams. filter was one; it filters uninteresting elements out of a stream. Similarly, transform takes one stream and turns it into another. If you think of filter as a stream version of Perl's grep function, you should think of transform as the stream version of Perl's map function:
sub transform { my $self = shift; return &empty if $self->is_empty; my $map_function = shift; Stream->new(&$map_function($self->head), sub { $self->tail->transform($map_function) } ); }
If we'd known about transform when we wrote hamming above, we would never have built a separate scale function; instead of $s->scale(2) we might have written $s->transform(sub { $_[0] * 2 }).
$squares->transform(sub { $_[0] * 2 })->show(5) 2 8 18 32 50
We'll see a more useful use of this a little further down.
Here are a couple of very Perlish streams, presented without discussion:
# Stream of key-value pairs in a hash sub eachpair { my $hr = shift; my @pair = each %$hr; if (@pair) { Stream->new([@pair], sub {&eachpair($hr)}); } else { # There aren't any more ∅ } } # Stream of input lines from a filehandle sub input { my $fh = shift; my $line = <$fh>; if ($line eq '') { ∅ } else { Stream->new($line, sub {&input($fh)}); } } # Get first 3 lines of standard input that contain `hello' @hellos = &input(STDIN)->filter(sub {$_[0] =~ /hello/i})->take(3);
iterate takes a function and applies it to an argument, then applies the function to the result, then the the new result, and so on:
# compute n, f(n), f(f(n)), f(f(f(n))), ... sub iterate { my $f = shift; my $n = shift; Stream->new($n, sub { &iterate($f, &$f($n)) }); }
One use for iterate is to build a stream of pseudo-random numbers:
# This is the RNG from the ANSI C standard sub next_rand { int(($_[0] * 1103515245 + 12345) / 65536) % 32768 } sub rand { my $seed = shift; &iterate(\&next_rand, &next_rand($seed)); } &rand(1)->show; 16838 14666 10953 11665 7451 26316 27974 27550 31532 5572 &rand(1)->show; 16838 14666 10953 11665 7451 26316 27974 27550 31532 5572 &rand(time)->show 28034 22040 18672 28664 13341 15205 10064 17387 18320 32588 &rand(time)->show 13922 629 7230 7835 4162 23047 1022 5549 14194 25896
Some people in comp.lang.perl.misc pointed out that Perl's built-in random number generator doesn't have a good interface, because it should be seeded once, but there's no way for two modules written by different authors to agree on which one should provide the seed. Also, two or more independent modules drawing random numbers from the same source may reduce the randomness of the numbers that each of them gets. But with random numbers from streams, you can manufacture as many independent random number generators as you want, and each part of your program can have its own, and use it without interfering with the random numbers generated by other parts of your program.
Suppose you want random numbers between 1 and 10 only? Just use transform:
$rand = &rand(time)->transform(sub {$_[0] % 10 + 1}); $rand->show(20); 1 5 8 2 8 10 4 7 3 10 3 6 3 8 8 9 7 7 8 8
Of course, if we do $rand->show(20) again, we'll get exactly the same numbers. There are an infinite number of random numbers in $rand, but the first 20 are always the same. We can get to some fresh elements with drop:
$rand = $rand->drop(10);
This is such a common operation, that we have a shorthand for it:
$rand->discard(10);
We can also use iterate to investigate the hailstone numbers, which star in a famous unsolved mathematical problem, the Collatz conjecture. The hailstone question is this: Start with any number, say n. If n is odd, multiply it by 3 and add 1; if it's even, divide it by 2. Repeat forever. Depending on where you start, one of three things will happen:
The unsolved question is: Are there any numbers that don't fall into the 4-2-1 loop?
# Next number in hailstone sequence sub next_hail { my $n = shift; ($n % 2 == 0) ? $n/2 : 3*$n + 1; } # Hailstone sequence starting with $n sub hailstones { my $n = shift; &iterate(\&next_hail, $n); } &hailstones(15)->show(23); 15 46 23 70 35 106 53 160 80 40 20 10 5 16 8 4 2 1 4 2 1 4 2
iterate_chop takes the infinite stream produced by iterate, and chops off the tail before the sequence starts to repeat itself.
&hailstones(15)->iterate_chop->show(ALL); 15 46 23 70 35 106 53 160 80 40 20 10 5 16 8 4 2
By counting the length of the resulting stream, we can see how long it took the hailstone sequence to start repeating:
print &hailstones(15)->iterate_chop->length; 17
Of course, you need to be careful not to ask for the length of an infinite stream!
Clearly, you could solve these same problems without streams, but oftentimes it's simpler to express your problem in terms of filtering and merging of data streams, as it was with Hamming's problem. With streams, you get a convenient notation for powerful data flow ideas, and you can apply your experience in programming Unix shell pipelines.
The implementation of streams in Stream.pm is wasteful of space and time, because it uses an entire two-element hash to store each element of the stream, and because finding the n'th element of a stream requires following a chain of n references. A better implementation would cache all the memoized stream elements in a single array where they could be accessed conveniently. Our Most assiduous Reader might like to construct such an implementation.
A better programming interface for streams would be to tie the Stream package to a list with the tie function, so that the stream could be treated like a regular Perl array. Unfortunately, as the man page says:
WARNING: Tied arrays are incomplete.
ML for the Working Programmer, L.C. Paulson, Cambridge University Press, 1991, pp. 166--185.
Structure and Interpretation of Computer Programs, Harold Abelson and Gerald Jay Sussman, MIT Press, 1985, pp. 242--286.
Return to: Universe of Discourse main page | What's new page | Perl Paraphernalia| Infinite Lists