How Unix pipes work
Will not appear in live blog formula provider: mathjax
A few years ago I gave a series of talks on how Unix works, discussing
Unix's concepts of the file and the process. In the process talk I
implemented some Unix commands in Perl, as examples, and one of the
largest of these examples was a shell. The shell examples started
very simple, and then added more shell features: first file
redirections, then built-in commands like cd , and eventually pipes.
Unfortunately the talk wasn't long enough to explain how the pipes
worked, so I'm going to do it now.
What is a pipe?
In the shell, we write something like
ls | rev
and this runs the ls and rev commands, and arranges that the
output of ls goes into the input of rev . The output-input
redirection passes through a kernel construct called a pipe.
At the bottom, the pipe is nothing more than a buffer in kernel
memory, typically 64 kilobytes. (On older systems the buffer was 8 or
even 4 kilobytes, but for the rest of this article we'll assume 64.)
The buffer can be read from or written to. When the pipe is created,
the kernel allocates two open file pointers for it, one for reading
and one for writing, and these file pointers become the access points
for processes to store data into the buffer and retrieve the stored
data again.
In Perl, it looks like this:
my ($rd, $wr);
pipe($rd, $wr) or die "pipe: $!";
syswrite $wr, "I like pie\n";
my $line = <$rd>;
print ">> $line";
The output is:
>> I like pie\n
The pipe call allocates the buffer and the two filehandles, which
are stored in $rd (for reading from the buffer) and $wr (for
writing to the buffer). These handles are traditionally called the
“reading end” and the “writing end” of the pipe.
I'm using syswrite to write "I like pie\n" into the pipe, rather
than print , because data printed with print is buffered by the
standard I/O library and would be an unnecessary confusion in this
example program. (For more details, see Suffering from
Buffering?.) Skipping
the standard I/O buffering will expose the kernel's basic behavior and
make it easier to observe. Once the data is in the pipe, we can use
the regular Perl <…> operator to read it back out again. (Later
we'll switch from <…> to sysread to get rid of the standard I/O
library here also.)
This is not especially useful, because we could just as easily have
used $line = "I like pie\n" and left the kernel out of the
procedure. But the advantage of the pipe is that it can be shared
among several processes, which can then use it for interprocess
communication. We will see this shortly.
Pipes are just buffers
But before we do, please take note of two important points. First:
bytes are read out of the pipe in the order they went in: The I was
first in, and it was first to be read out again; the jargon for this
is that pipes are FIFO (“first-in first-out”) and in some contexts
they are even called FIFOs.
The second point is more subtle: Pipes are nothing but byte buffers.
Any structure on the messages written into them must be imposed by the
application.
Here's an example just like the previous one, but with one more write:
my ($rd, $wr);
pipe($rd, $wr) or die "pipe: $!";
syswrite $wr, "I like pie...";
syswrite $wr, "Especially raspberry\n";
my $line = <$rd>;
print ">> $line";
The output is:
>> I like pie...Especially raspberry\n
Here we wrote two messages into the pipe. Will the <$rd> extract
the first message separately? No. The Perl <…> operator always
reads characters up to the next newline (or whatever $/ is set to).
Here it reads all the way up to the newline after the word
raspberry .
We can see the lack of structure more clearly if we use Perl's
sysread operator, which reads a fixed number of bytes:
my ($rd, $wr);
pipe($rd, $wr) or die "pipe: $!";
syswrite $wr, "I like pie...";
syswrite $wr, "Especially raspberry!";
my $bytes;
while (sysread $rd, $bytes, 4) {
print ">> '$bytes'\n";
}
Now the output is:
>> 'I li'
>> 'ke p'
>> 'ie..'
>> '.Esp'
>> 'ecia'
>> 'lly '
>> 'rasp'
>> 'berr'
>> 'y!'
and then it hangs. (We'll see why it hangs in the next section.)
Unix is happy to give our the reader four bytes at a time, even though
the bytes were written in groups of 13 and 21. Or at least, it's
happy to do that up until there are only two bytes left; then the
program asks for 4 but gets only 2. And the read after that one
hangs. Why?
Semantics of reading pipes
At the bottommost level, you read a pipe with the Unix
read call, which corresponds approximately with Perl's sysread
function. (Perl's read and <…> introduce the standard I/O
library, which is an additional complication we'll consider later.)
At the C level, the call looks like this:
int fd;
char buffer[65536];
size_t bytes_to_read;
int bytes_read = read(fd, buffer, bytes_to_read);
The fd variable is a file descriptor, which is the kernel's
low-level version of a filehandle; it is simply an integer that
identifies what to read from. The buffer tells the kernel where to
store the data once it's read. And bytes_to_read is a non-negative
integer that tells the kernel how many bytes we want to read.
The short description of what this does is: it reads at most
bytes_to_read out of the pipe and stores the data in the buffer;
then it returns the number of bytes it actually read. But there are a
number of fine points and exceptions:
- An error might occur. For example, the caller might have
provided a bad file descriptor or buffer pointer. In this case
read returns -1 and sets the kernel error indicator, errno,
to indicate what the problem was. In Perl, errno shows
up in the special variable $!.
- If there are at least bytes_to_read bytes in the pipe, then
that is how many are read and stored into the buffer, and that is
the number returned. The buffer had better be big enough to hold
the data, or the kernel will cheerfully overwrite the process’s
memory with the extra!
- If there is at least one, but fewer than bytes_to_read bytes in
the pipe, all of them are read, and their number is returned.
- However, if there are no bytes in the pipe, then:
- If the writing end of the pipe has been closed, the request
returns 0; the process should interpret this as an end-of-file
condition.
- If the writing end of the pipe is still open, the request
blocks: the kernel puts the process to sleep until data
becomes available or the writing end is closed.
- (Exception for advanced users: the file descriptor can be
marked non-blocking, in which case the read
call never blocks; instead the blocking call turns into an
error: it immediately returns -1 and sets errno to
EWOULDBLOCK (“Operation would block” or sometimes
“Resource temporarily unavailable”.)
Now we can see why the last example program hung. There were 34 bytes
in the pipe. The program issued read calls with bytes_to_read set
to 4, and the first eight such calls read 4 bytes each, returning the
number 4 each time. (That's case 1.( The ninth read found only 2 bytes in the pipe
and read them, returning 2. (That's case 2.) And the tenth read found no bytes in
the pipe. But the program still had the writing end open, so the call
blocked. (That's case 3b.) And since no other process held the writing end of the pipe,
the process could never wake up!
We can fix this:
my ($rd, $wr);
pipe($rd, $wr) or die "pipe: $!";
syswrite $wr, "I like pie...";
syswrite $wr, "Especially raspberry!";
close $wr;
my $bytes;
while (sysread $rd, $bytes, 4) {
print ">> '$bytes'\n";
}
Now after the program prints >> 'y!' it exits. Why?
After the process has done its writing, it closes the writing end of
the pipe. Now the reading goes as before, up through the ninth read
of y! . The tenth read finds an empty pipe as before. But this time
the writing end is closed, so instead of blocking the read call
immediately returns 0. (This time it;s case 3a instead of 3b.) Perl
passes the 0 value into the script as the value returned by sysread ,
which terminates the while loop and ends the program.
Interprocess communication with pipes
Having the same process be both the reader and the writer is a little
strange and not very useful. It's also tricky to pull off correctly,
because pipes were nor really designed to be used in this way. The
normal use case is that one process reads and another writes. To do
that, one process needs to hold the writing and of the pipe and the
other needs to hold the reader.
Typically the way this is done is as follows. One process creates the
pipe, and then forks a child process. File descriptors are inherited
from parent to child after a fork, so both processes have both
ends of the pipe. Let's consider a typical scenario, where
the parent runs a command and wants to read its output and then
continue.
The child will be the writer and
the parent will be the reader, so
the child closes the reading end of the pipe, and
the parent closes the writing end. The child then uses the writing
end to write data into the pipe; the parent uses the reading end to
read the data back out.
If the parent gets ahead of the child, it tries to read the empty
pipe, and blocks until the child writes more data.
Eventually, the child has nothing more to say and closes the writing
end of the pipe, either with an explicit close call or more likely by exiting.
After the parent reads the remaining data, the pipe is empty. Its next
read returns 0, signalling end-of-file, and it can close the reading
end and proceed as
appropriate.
Here's complete code for a demo:
my ($rd, $wr);
pipe($rd, $wr) or die "pipe: $!";
my $pid = fork();
die "Couldn't fork: $!" unless defined $pid;
if ($pid == 0) { # child (writer) process
close $rd;
print $wr "abcdefghijklmnopqrstuvwxyz\n";
} else { # parent (reader) process
close $wr;
my $buf;
my $line = 0;
while (sysread $rd, $buf, 1) {
print "Reader: ", ++$line, ": $buf\n";
}
print "Reader: End of file\n";
}
The process forks and the two resulting processes take different paths
through the if -else block. The child process takes the if
part, closing the reading end of the pipe, writing data into the pipe,
and exiting immediately afterward. The written data is safe in the
kernel and will survive the death of the process that wrote it.
The parent takes the else clause. It closes the writing end of the
pipe to avoid a deadlock just like the one we saw in the previous
section, and then loops on sysread as before, reading one character
at a time. It transforms the input that the parent wrote, prints out
the result, and then exits.
It is quite simple to have the child be the writer and the parent the
reader; just change the if ($pid != 0) test to if ($pid == 0) .
Attaching the pipe to a command
The example of the previous section isn't quite typical: what's the
point of forking a process and creating a pipe just to get the string
abcdefghijklmnopqrstuvwxyz\n ? Instead, we'll have the child run the
ls -l command so that the parent gets the command output.
The ls command writes to standard output, which is inherited from
the parent and is attached to the terminal. We want to arrange that
the child's standard output is attached to the writing end of the pipe
instead. Then when ls runs it will write into the pipe.
my ($rd, $wr);
pipe($rd, $wr) or die "pipe: $!";
my $pid = fork();
die "Couldn't fork: $!" unless defined $pid;
if ($pid == 0) { # child (writer) process
close $rd;
my $fd = fileno($wr);
open STDOUT, ">&=$fd"
or die "Couldn't dup pipe descriptor: $!";
exec "ls", "-l";
die "Couldn't exec ls: $!";
} else { # parent (reader) process
close $wr;
my $buf;
my $line = 0;
while (sysread $rd, $buf, 1) {
print "Reader: ", ++$line, ": $buf\n";
}
print "Reader: End of file\n";
}
There's a lot of Perl weirdness here; oddly, the code is simpler in C!
The child process needs to attach the writing end of the pipe, $wr ,
to its standard output. The way it does this is by obtaining the file
descriptor number of the writing end with fileno and then using this
number in the odd-looking “filename” >&=$fd in the open call. If that
succeeds, it runs the ls command with exec . A successful exec
does not return—or rather, it returns inside ls rather than inside
our Perl script—and ls takes over from there, writing its usual data
into the pipe for the parent to read.
In C, this is a somewhat less weird-looking: (some details, such as
error checking¸ are omitted)
int rd, wr;
pipe(rd, wr);
...
if (pid == 0) { /* child (writer) */
close(rd);
dup2(wr, 1); /* stdout is always descriptor 1 */
execvp("ls", "-l", (char *) 0);
} else { /* parent (reader) */
...
}
The dup2 call is doing the heavy lifting for Perl's bizarre
open STDOUT, ">&=$fd" thing. It says to take whatever is attached
to file descriptor wr and attach it to file descriptor 1 also.
Standard output is file descriptor 1 by definition, because commands like
ls are written to write to file descriptor 1, whatever it is
attached to. (Descriptor 0 is standard input, and descriptor 2 is
standard error.)
Semantics of writing pipes
In a C program, pipes are written to with the kernel's write call,
and every function you use in Perl, including print , syswrite ,
say , and so forth, eventually turns into write down in
the kernel. The call looks just like the read call that was used
for reading:
int fd;
char buffer[65536];
size_t bytes_to_write;
int bytes_written = write(fd, buffer, bytes_to_write);
Again, the fd variable is a file descriptor, the buffer tells the
kernel where the data is coming from, and bytes_to_write says how
many bytes to copy out of the buffer.
The basic semantics are almost exactly opposite to those of read : a
successful call copies at most bytes_to_write bytes out of the
buffer and puts them into the pipe. But again there are a number of
fine points and exceptions:
- An error might occur. The behavior is the same as for read:
the call returns -1 and sets errno.
- If the reading end of the pipe has been closed, the
kernel sends the process the SIGPIPE signal, which normally
kills it instantly.
- However, the process can arrange beforehand to
catch the signal, in which case its signal handler function is called
instead.
- Or it can ignore the signal, in which case the write
call returns -1 and errno is set to EPIPE (“broken
pipe”).
- If the reading end of the pipe is still open, all the data is
copied from the buffer into the pipe, and
bytes_to_write is returned. In contrast to the
read case, partial writes don't happen.
What if there is not enough room in the pipe for
bytes_to_write bytes? For example, what if the
buffer is so big that it can't fit into the pipe all at once?
Then the write call blocks until all the data has been
written and still returns bytes_to_write. That is,
the process goes to sleep while the write is in
progress. The kernel copies as much data as will fit, and the
process stays asleep. When some space is freed up in the pipe by
data being read out, the kernel copies more data into the pipe. The
writing process does not wake up until the last byte is copied,
whereupon the write call finally returns
bytes_to_write
- (Exception for advanced users: sometimes partial writes do
happen; for example if a signal arrives in the middle of a long
write.)
- (Or the writing end of the pipe can be marked non-blocking
which converts the block into an immediate error result, just
as in the read case.)
The upshot of all this is that you can attach the standard output of
ls -l to your process’ pipe, and then you can ignore it it. It will
do its thing and write whenever it finds it convenient. If there is
room in the pipe, it will continue writing until the pipe is full,
whereupon it will go to sleep (case 2) and wait for your process to
empty the pipe again. When it exits, the reading end will close, and
your process will read the rest of the data and then get an
end-of-file indication. If your process dies prematurely, then the
next time ls tries to write to the pipe the kernel will kill it
(case 1).
[Other articles in category /Unix]
permanent link
|