Nimble Computer Corporation

16231 Meadow Ridge Way, Encino, CA 91436 (818) 501-4956 (818) 986-1360 (FAX)

Copyright (c) 1993 by Nimble Computer Corporation

This material is based upon work supported by the National Science Foundation under Grant No. III-9261682.

Girard's linear logic can be used to model programming languages in
which each bound variable name has exactly one "occurrence"--i.e., no
variable can have implicit "fan-out"; multiple uses require explicit
duplication. Among other nice properties, "linear" languages need no
garbage collector, yet have no dangling reference problems. We show a
natural equivalence between a "linear" programming language and a
stack machine in which the top items can undergo arbitrary
permutations. Such permutation stack machines can be considered
combinator abstractions of Moore's *Forth* programming
language.

For 40 years, programmers have tried to utilize mathematical
expressions to program computers. Indeed, "FORTRAN" is a contraction
of "FORmula TRANslator", because Fortran's selling point was its close
relationship to mathematical formulae. For example, one solution of a
quadratic equation x=(-b+sqrt(b^2-4ac)/2a in Fortran is
`X=(-B+SQRT(B**2-4*A*C))/(2*A)`

.

Although this one-dimensional form is not as pretty as the mathematician's two-dimensional form, the relationship is clear, and user can look forward to Fortran-0X, which we expect will utilize 2-D input notation pioneered (on punched cards!) by NASA's HAL/S language [Ryer78].

Unfortunately, after 40 years of hard work, computer scientists have not been that successful at efficiently exploiting this mathematical expression metaphor. Furthermore, the single-minded pursuit of this goal has blinded us to the reason for pursuing it in the first place--to find a simple, easy-to-understand metaphor that can be used to efficiently program a variety of applications on a variety of computers.

Fortran has been quite naive about the nature of mathematical expressions, and Fortran novitiates learn to "put that mathematical metaphor up on the shelf next to the Easter Bunny", and use a more implementation-oriented model which includes "storage maps", "by-reference parameters", and the like.

*Functional* programming languages [Backus78] [Hudak89]
[Hughes89] have gone the furthest with the mathematical expression
metaphor, and have been extended to cover many programming tasks.
Unfortunately, functional programs have been considered "inefficient",
and their higher-order functions, "lazy evaluation" and
"multiple-return continuations" leave many cold.

The "standard", non-functional programming languages like Fortran,
Ada, and C are the bastard progeny of the coupling between a
pseudo-mathematical notation and a von Neumann-style random access
memory (RAM). *There has never been a simple or particularly
effective mathematical model to aid in the deep understanding or
compiling of programs in these languages.* A modern optimizing
Fortran compiler is still a baroque technical *tour de force,*
yet it has difficulty understanding simple Fortran programs. The real
Achilles' heel for these languages, however, is their innate inability
to deal with parallel or distributed computation. Neither the
mathematical expression metaphor nor the von Neumann RAM is up to this
task.

"The Emperor has no clothes", you say. "Cut the Gordion knot", you say. "This many smart computer scientists can't be wrong," we say, so the problem--as posed--must be essentially insoluble. In classical logical fashion, computer science has achieved a proof by contradiction. It has proposed the equivalence "programs = mathematical expressions", and has derived a contradiction. Perhaps it is time to move on to the next theorem.

*Simula* [Dahl66] started a revolution in computer languages
which is not yet finished. It proposed the equivalence "programs =
physical objects with behavior", but forgot to throw away the previous
equivalence. *Smalltalk* [Goldberg83] picked up the ball, but
the mathematical expressionists then converted the metaphor of
physical objects into polymorphic mathematical type systems--i.e., C++
[Stroustrup86]. The elegance of the mechanical metaphor was thus
buried underneath mathematical mysticism.

*Linear logic* [Girard87] can be viewed as the latest
attempt to bring back the physical object metaphor, but stripped of
its polymorphic pretensions. For the first time in 50 years of
computer science, a metaphor of programming has been proposed that
most people can relate to--objects have true identity, and *objects
are conserved.* As in the real world, *an object cannot be
copied or destroyed* without first filling out a lot of forms, but
on the other hand, *the transmission of objects is relatively
painless.* An object is localized in space, and can move from
place to place. (Only computer-literate people must be told that the
transmission of these objects does *not* create copies.)
Linear logic finally makes precise the high school notion that a
function is like a box into which one puts argument values and
receives result values, and that a truly "functional" box does not
remember its previous arguments or results.

The mathematical expression metaphor must be sacrificed to make way for this linear/conservative object-oriented metaphor, which--as we have seen--is not a great loss, as computer programs today deal with non-mathematical objects most of the time. [footnote 1] The few remaining mathematical expressions--e.g., the quadratic formula--can be computed, but require a more process-oriented description. In such a description, we must first make copies of b and a, since they are both used twice, and then we can compute the result. The requirement for making explicit copies of b and a is obvious to any 12-year-old, but computer scientists have spent 40 years eliminating this one trivial task while greatly increasing the costs for everything else.

A linear logic language achieves its elegance through a starkly
simple rule--a bound variable name can be "used" only once. Thus,
variable *reads are destructive* and hence variables are
"read-once". Any attempt to use a name twice (or not at all) is
flagged by the compiler. A use of the variable name as an argument in
a function call means that the object referred to by the name has been
given to the function. Unless the function returns the object as one
of its results, the object is gone, and cannot be referenced by the
caller. Therefore, many operations on linear objects have the policy
of returning them when they are done--e.g., the `length`

function returns the length of a list as well as the list itself. A
function like "+", which accepts two values and returns their sum, can
be thought of as "consuming" its argument values and constructing a
result value. Not only is this metaphor appealing, but it can be very
efficient in practise--e.g., in multiple-precision arithmetic, the
storage utilized by the arguments to "+" can be reused to construct
the result.

A linear logic computer language avoids the need for garbage collection by explicit construction and destruction of objects. The beloved "constructors" and "destructors" of C++ can therefore be used, as before. However, the "dangling reference problem" cannot happen in a linear language because 1) the only name occurrence for an object is used to invoke its destructor, and 2) the destructor doesn't return the object. Furthermore, if an object does not wish to be copied, the programmer cannot obtain a copy; thus, linearity achieves the goals that "limited" types in the Ada programming language [AdaLRM83] [Baker91SP] [Baker93Steal] sought in vain.

A linear logic language avoids the need for most synchronization
because there is only one "path" to an object at any given time, and
hence only one operation can be performed on the object at a time.
The calling program cannot even talk about the object while an
operation on it is being performed, but must wait for the object to be
returned. On the other hand, complex subexpressions can always be
computed in parallel, because there is no possibility of read or write
conflict between sub-expressions (linearity allows the environment to
be *factored* into disjoint factors that bind the names in each
subexpression.). A linear language thus makes explicit any
"space-time" tradeoff: space can be minimized by always working with
only one copy of an object, and performing each operation serially,
while time can be minimized by duplicating the object at some cost in
space.

Linear Lisp is a "linear" style of Lisp--i.e., every bound name is
referenced exactly once. Thus, each function parameter occurs just
once, as do the names introduced via other binding constructs--e.g.,
`let`

, `let*`

, etc.

Linear Lisp requires the programmer to make explicit any
duplication and deletion, but he is repaid through better error
checking during compilation and better utilization of resources (time,
space) at run-time. Furthermore, distributed and/or parallel
execution becomes quite palatable in *Linear* Lisp.

We demonstrate the natural mapping of Linear Lisp onto a
permutation stack machine with some examples. The
`identity`

function is already linear:

(defun identity (x) x)

Linear Lisp is implemented with a stack on top of which all
argument values are placed and all result values are returned. This
stack is "unframed"--i.e., there is no Algol-like frame pointer. The
elimination of the frame pointer simplifies the model and eliminates
the cost of building and disassembling stack frames. To call the
`identity`

function, an argument is placed on the top of
the stack, the `identity`

function is called, it
*moves* its argument to the top of the stack as its result, and
the function then returns. Of course, as the argument is already on
the top of the stack, no movement is required, so there is *no*
code "`[]`

" actually needed for the `identity`

function.

We now consider a function which always returns the constant 5:

(defun five (x) x 5) ; x is explicitly killed.

`five`

is called with its argument on the top of the
stack, but it must be destroyed before 5 can be pushed onto the stack.
The argument is destroyed by a `drop`

operation, which pops
the top item and destroys it. The primitive constant 5 is then copied
and pushed to finish the `five`

function: ```
[drop
'5]
```

.

`square`

requires *two* uses of its argument. A
second copy can be obtained by use of the `dup`

operation,
which accepts one argument and returns *two* values--i.e., two
copies of its argument. The `square`

function follows:

(defun square (x)
(let* ((x x-prime (dup x))) ; use Dylan syntax [Shalit92].
(* x x-prime)))

Although the source code for `square`

is a bit long, its
implementation on our stack machine is very efficient. One argument
is passed to `square`

on the stack, `dup`

replaces this value by two copies on the top of the stack, and * then
multiplies them together.
[footnote 2]
The code for `square`

is thus "`[dup *]`

".

The quadratic equation function is as follows:

(defun quadratic (a b c) ; stack is now ...abc
(let* ((a a-prime (dup a)) (b b-prime (dup b))) ; [footnote 3]
(/ (+ (- b-prime) (sqrt (- (square b) (* 4 (* a c)))))
(* 2 a-prime))))

`quadratic`

requires variable *movement* on the
stack. We must duplicate a, but a is not on top of the stack.
Therefore, we first permute the top three items of the stack to get a
on top, and then duplicate a. The stack then looks like ...bcaa'. We
must then duplicate b, but b is not on top of the stack. Therefore,
we permute the top *four* items to get b on top, and then
duplicate b. The stack then looks like ...caa'bb'. The next
operation negates b', which we arranged to be on the top of the stack.
The next operation after that permutes b to the top, so that
`square`

can be called. We continue in this fashion,
alternately permuting the stack and calling functions. The code for
the quadratic function is therefore: ```
[roll3 dup roll4 dup neg
roll2 square '4 roll5 roll6 * * - sqrt + '2 roll3 * /]
```

. We can
see in operation the basic variable accessing mechanism
`roll<n>`

, which permutes the top n items of the
stack by moving the n'th item to the top and pushing the other n-1
items down by one. A compiler can easily keep track of where
everything is.

(The reader may feel that permuting an item to the top of the stack
simply for the purpose of copying it is wasteful, as nearly all stack
machines have the ability to *copy* an item directly to the
top. We believe this intuition to be incorrect. First, while it may
be easy to copy objects like small integers from deep in the stack, it
is unlikely that the duplication operation for an arbitrary precision
integer or an abstract data type is a primitive operation, and when a
non-primitive duplication function must be called, the argument and
results should be on top of the stack. Second, a key assumption of
linear languages is that permuting items to the top of the stack is
relatively inexpensive (relative to copying or destroying them), and
hence the percolation of the second copy *down* after the
`dup`

will happen *automatically*, as a side-effect
of rolling the other items *up* to the top of the stack.)

Every iterative/recursive function requires some sort of
conditional execution. Conditional expressions--e.g.,
`if`

-expressions--in Linear Lisp require sophistication
beyond the simple linear rules we have considered so far. Since only
one "arm" of the conditional can be executed at any given time, we
relax the "one-occurrence" linearity condition to allow a reference in
both arms, so long as they are not executed concurrently or
speculatively. There is an occurrence in one arm *if and only
if* there is an occurrence in the other.

The proper treatment of the boolean expression part of an
`if`

-expression requires more sophistication. Strict
linearity requires that any name used in the boolean part be counted
as an occurrence. However, many predicates are "shallow", in that
they examine only a small (i.e., shallow) portion of their arguments
(e.g., `null`

, `zerop`

), and therefore a relaxed
policy is required. We have not settled on the best syntax to solve
this problem, and currently use several `if`

-like
expressions: `if-atom`

, `if-null`

,
`if-zerop`

, etc. These `if`

-like expressions
require that the boolean part be a simple name, which does not count
towards the "occur-once" linearity condition. This modified rule
allows a shallow condition to be tested, and the name to be reused
within the arms of the conditional.
[footnote 4]
The predicate could alternatively return *two* values: the
truth value (on top of the stack) and its unmodified argument(s).

The absolute value function `abs`

demonstrates
conditional execution:

(defun abs (x)
(let* ((x truth (minusp2 x))) ; leave x & truth value.
(if truth (- x) x)))

A more perspicuous version is:

(defun abs (x)
(if-minusp x (- x) x)) ; check x, but don't consume it.

The code for `abs`

is simply ```
[minusp2 [neg] []
ifelse]
```

.

To use Lisp-like cons cells with their car and cdr components, we
need a mechanism to *linearly* extract these components, since
any use of `(car x)`

precludes the use of
`(cdr x)`

, and vice versa, due to the linearity of x. We
therefore utilize a function `carcdr`

, which takes a cons
cell and returns both components. The `carcdr`

function is
the stack machine *inverse* of `cons`

. Using these
constructs, we can program the `append`

function:

(defun append (x y) ; concatenate lists x and y.
(if-atom x (progn x y) ; examine x, kill x, return y.
(let* ((carx cdrx (carcdr x)))
(cons carx (append cdrx y)))))

`append`

is ```
[roll2 atom2 [drop] [carcdr roll3
append cons] ifelse]
```

.

Decomposing list structure is common in Linear Lisp, that we
provide a "destructuring let" form `dlet*`

in which
patterns with variable names are matched/unified against a value.
Since `dlet*`

is linear, `dlet*`

patterns
*consume* the portions of the values that they match--i.e., the
parts that are not bound to new names in the process of matching. Of
course, linearity requires that a pattern bind a name only once. We
can also use a *backquote* list *composing* syntax
[Steele90].
A prettier version of `append`

follows:

(defun append (x y)
(if-atom x (progn x y)
(dlet* (((carx . cdrx) x))
`(,carx ,@(append cdrx y)))))

In these `append`

functions, we have swept something
important under the rug--the code for `append`

is itself
non-linear, because it references itself in its own body! This is a
problem with all iterative and recursive constructs, and we have the
same solution for this problem that the lambda calculus has--the Y
combinator (and various optimizations of it), which does "lazy"
duplication. In other words, we abstract the `append`

call
out of its own code, and pass it to the kernel of `append`

as another argument (the `funcall`

special form takes its
first argument from the *top* of the stack):

(defun append-kernel (x y f) ; f will be append-kernel, itself
(if-atom x (progn x f y)
(let* ((carx cdrx (carcdr x)) (f f-prime (dup f)))
(cons carx (funcall f-prime cdrx y f)))))

The code for `append-kernel`

is: ```
[roll3 atom2
[drop drop] [carcdr roll4-2 dup funcall-3 check-1 cons]
ifelse]
```

. The primitive `roll4-2`

rolls/rotates the
top 4 items of the stack by 2, and is equivalent to ```
[roll4
roll4]
```

.

We note that the sequence `[funcall-3 check-1]`

calls a
function with 3 arguments and 1 result. The information about the
number of arguments and results is used to check for the case of a
function which is called with the wrong number of arguments or
results. The implementation of this may be a separate implicitly
addressed register which keeps track of the number of arguments passed
on a call and the number of results on a return. Such numbers could
easily be passed on the stack, but this would be inefficient for the
vast majority of calls which have fixed numbers of arguments and
returned values. To check for 3 arguments to
`append-kernel`

, use `check-3`

.

(defun append (x y)
(let* ((f f-prime (dup #'append-kernel)))
(funcall f-prime x y f)))

The code for `append`

itself is now:
`[#'append-kernel dup funcall-3]`

.

Using these same ideas, the program for `factorial`

is:

(defun fact (n f)
(if-zerop n (progn f (1+ n))
(let* ((n n-prime (dup n)) (f f-prime (dup f)))
(* n (funcall f-prime (1- n-prime) f)))))
(defun factorial (n)
(let* ((f f-prime (dup #'fact)))
(funcall f-prime n f)))

The code for `fact`

: ```
[roll2 zerop2 [roll2 drop 1+]
[dup roll3 dup roll3 1- roll3-2 funcall-2 *] ifelse]
```

. The code
for `factorial`

is `[#'fact dup funcall-2]`

.

We now examine an "iterative" factorial function. An
*iterative* factorial carries all of its "state" in its
arguments, and can therefore make all recursive calls
"tail-recursive". Tail-recursive calls are interesting, because they
do not require unbounded amounts of either argument stack or return
stack space. Furthermore, the complexity of a tail-recursive loop
function is independent of the number of its arguments. Below is a
linear iterative factorial function.

(defun ifact1 (r n f)
(if-zerop n (progn n f r)
(let* ((n n-prime (dup n)) (r (* n-prime r))
(n (1- n)) (f f-prime (dup f)))
(funcall f-prime r n f))))
(defun ifactorial (n)
(let* ((f f-prime (dup #'ifact1)))
(funcall f-prime 1 n f)))

The code for `ifact1`

: ```
[roll2 zerop2 [drop drop]
[dup roll4 * roll2 1- roll3 dup funcall-3] ifelse]
```

.
`ifactorial`

is ```
['1 roll2 #'ifact1 dup
funcall-3]
```

.

This iterative factorial is interesting because 1) its
tail-recursive property can be trivially recognized by syntactic
means, and 2) it can be implemented in a particularly efficient
manner. In fact, since `iterative-fact1`

references no
"free variables", no closures must be allocated or destroyed during
its execution. The entire state of `iterative-fact1`

lives
on the stack, and consists of the values of the variables r, n, and f.
The computations have the effect of permuting these values, but the
stack as a whole neither expands or contracts. Finally, the
permutation of these items necessary to execute the next iteration is
automatically achieved; no special tail-recursion optimization is
required [Steele78] [Hanson90]! In short, the execution of
tail-recursive loops on our permutation stack machine is not only just
as efficient as that achieved with a special iteration construct, it
is *exactly the same as* that achieved with the special
iteration construct, but without "syntactic sugar".

We eventually get tired of programming each recursive function with
its own driver function and its own function duplication code, in
which case we would like to program a true Y combinator [Gabriel88].
This Y combinator incorporates all of the lazy duplication machinery
necessary to implement iteration and recursion in a linear language.
In order to utilize Y, we will use a slightly different form of
recursive kernel, in which the function is abstracted separately.
Consider again `factorial`

:

(defun fact-kernel (f)
#'(lambda (n)
(if-zerop n (progn n f 1)
(let* ((n n-prime (dup n)))
(* n (funcall f (1- n-prime)))))))

A *closure* must be compiled within
`fact-kernel`

. The anonymous inner function of
`fact-kernel`

is the function which will receive the actual
numerical argument. However, this inner function has a free variable
f which must be carried along with the anonymous function so that when
that function is called, the value of this free variable can be
obtained. The structure which carries this free value is a
*closure*, and consists of a 3-element vector whose last
element is the indicator `'funarg`

, whose second element is
the anonymous inner function code, and whose first element is the
value of the variable f. When the closure is invoked, the anonymous
code will be called with a stack which looks like ...nf. In other
words, the entire closure vector is "spread" onto the stack (about the
same effort as reading an entire cache line), the indicator 'funarg is
dropped, and the code sequence is loaded into the instruction buffer,
leaving only the free variables themselves on the stack. The effect
of this strategy is to allow the anonymous inner function to be
compiled as if its argument list were (n f).

`fact-kernel`

is ```
[#'fact-kernel1 'funarg
vec3]
```

, while the inner function `fact-kernel1`

is
```
[roll2 zerop2 [drop drop '1] [dup 1- roll3 funcall-1 *]
ifelse]
```

.

We need a version of the Y combinator which handles recursive
kernels like `fact-kernel`

which take 1 argument:

(defun y1 (k)
(let*
((g g-prime
(dup #'(lambda (g)
#'(lambda (n)
(let* ((g g-prime (dup g)))
(funcall (funcall k (funcall g-prime g))
n)))))))
(funcall g-prime g)))

Using `y1`

, we can define non-recursive
`factorial`

itself:

(defun factorial (n) (funcall (y1 #'fact-kernel) n))

The code for `y1`

is straight-forward: ```
[#'y1-1
'funarg vec3 dup funcall-1]
```

. The code for `y1-1`

is
`[#'y1-2 'funarg vec4]`

. Finally, the code for
`y1-2`

is ```
[roll2 dup funcall-1 roll2 funcall-1
funcall-1]
```

.

Interestingly, while the source for `y1`

handles only
kernels of one argument, the compiled code handles kernels of
*any* number of arguments, by simply removing the setting and
checking of the register keeping track of the number of arguments!
This is because `y1`

doesn't really deal with the arguments
to the kernel itself, but only with the function closure which is
lazily copied during the recursion.

With additional effort, a general Y combinator can be programmed
which handles mutually recursive kernels
[Baker92MC].
If the duplication of short closure vectors and vectors of mutually
recursive functions can be done quickly--i.e., these vectors are
"copied" by-reference (and thus *reference-counted* [Collins60]
[Baker92LLL]
), rather than truly copied, then recursion and iteration can be as
efficient (O(1)) as in environments having cyclic code. Furthermore,
it is well-known how to optimize closure creation and destructuring
[Steele78], so our machine need not pay this cost for linearity.

Since a generic Y operator for a mutually-recursive system of functions must constantly recirculate and lazily copy a vector of routines in a strictly linear system, the accessing of the elements in these vectors is an excellent place for implicit copying, if such copying is going to be used anywhere. That efficient mutual recursion requires copying is not surprising, but the explicit copying makes more obvious the problem of implementing mutual recursion in a distributed system. Mutual recursion can be performed as a distributed computation, but the expense of explicit copying may make it unpalatable. Nevertheless, linearity will keep precise the meaning of this distributed computation, and guarantee that a system with similar recursions taking place simultaneously on different machines will have the same result as if all of the computations take place on a single machine.

*Combinatory logic* [Curry58] is a logical structure which
is closely related to the *lambda calculus* [Church41]. The
lambda calculus talks about names and substitutions in expression
trees, while combinatory logic achieves the same "computations", but
without needing any names. Backus's speech on the advantages of
functional programming [Backus78] considers the ability of combinatory
logic to eliminate names to be one of its major advantages. Most APL
operators are combinators on array-type objects, and the absence of
names from APL "one-liners" is quite characteristic of
combinators.

When translating from lambda calculus expressions into
*combinators*--as the primitives of combinatory logic are
called, one must replace *random access* to a value by means of
a name with *steering logic* to propagate values to the
locations in an expression tree where they will be used. One of the
simplest translations involves distributing two *copies* of a
value down both branches of a binary tree (the **S**
combinator), followed by the *killing* of any copies at leaves
where they are not used (the **K** combinator). Obvious
optimizations involve sending values only down branches where they
will be used by means of the non-copying **B** and
**C** combinators, which simply steer the values down
either the left or right branch of the binary tree, respectively.

A linear version of the lambda calculus normally translates into
only **B** and **C**-type combinators which
neither copy nor kill values. Since any copying and killing is done
explicitly by the programmer, these operations require
"interpretation"--i.e., they are not built into any control
structures, but require type-specific code for their implementation.
An obvious optimization for a set of linear combinators is to provide
more permutation combinators.

Most Forth operators take their operands from the top of the stack
and return their values to the top of the stack. A perusal of this
Forth code reveals the absence of variable names which is
characteristic of combinators. The programming of Forth operators can
therefore be seen as the construction of larger combinators from
smaller ones. A Forth which incorporates only stack permutation
operations like `swap`

, `rotate`

, and
`roll`

must be linear, because it has no copying or killing
operators.

The usual implementation of Forth utilizes *two* stacks--an
operand stack and a return stack. Many Forth implementations allow
the programmer to make temporary use of the return stack to perform
more complex permutations of the operand stack. If a relatively
complete set of permutation operations for the operand stack is
provided, then the user will rarely need to "roll his own".

Since Forth is usually implemented on a traditional von Neumann machine, one assumes that "rolling a stack" is an expensive operation. The von Neumann bottleneck limits the speed with which the information can be rearranged in the stack, because only one word can be moved at a time. However, the RISC revolution makes the basic instruction cycle as fast as possible--i.e., one instruction per clock, and then uses pipelining and compiler technology to remove any impediments to achieving this speed. Unfortunately, a RISC clock cycle is limited by the need to access both a register bank and an instruction cache once per cycle. While these can be done in parallel and overlapped, the basic fact remains that the cycle time of these small RAM's is the limiting factor in RISC performance. The limiting factor in RAM access time is the logn gate delays that are used to address these RAM's. Therefore, the only way to make the fastest RISC architecture is to limit the size of these RAM's, which necessarily limits the amount of data that can be stored in the registers or the number of instructions that can be stored in the instruction cache.

If we examine Forth instruction streams, however, we notice that
they are punctuated into two kinds of operations--a number of
permutation operations followed by a number of computing operations.
These permutation operations are considered "non-productive", while
computing operations are "productive". However, stack permutation
operations are no less productive than register loads, stores and
transfers in a RISC architecture, and due to their ability to move
more than one word at a time, there is evidence that permutations are
*more* productive than register loads and stores. Furthermore,
although permutations are probably more time-consuming than
"computing" instructions, a true stack architecture could execute
"computing" instructions blindingly fast. For example, AND'ing the
top two items of the stack should take only 4-5 gate delays including
reading from the top two items and storing back to one of them.
Accessing a dual-ported register bank of 64-128 registers should take
significantly longer, not counting the time to store the result. If
one considers "round-trip time", the result may not be usable in the
next clock cycle. Thus, a stack architecture could utilize a clock
period for non-permutation instructions which would be a small
fraction of that for a register machine, and a permutation unit should
take about the same time for its operation as a RAM of the same
size.

Most people describe the top several positions of the Forth stack as "locations", but it is more productive to think of them as "busses", since no addressing is required to read from them at all--the ALU is directly connected to these busses. The permutation operations necessary to get operands onto these busses can now be seen as the same sorts of operations that are used to extract values from registers. If one generalizes this idea, then one can conceive of multiple arithmetic operations being performed simultaneously on a number of the top items of the "stack". For example, the top 4 (or 8) items of the stack may be busses which are directly connected to an FFT butterfly network (itself a generalized permutation generator), in which case a radix-4 (or radix-8) FFT could be computed in parallel. Similarly, other operations on independent busses (top items of the stack) could be performed in parallel--e.g., computing an element-wise addition or multiplication of the top 2n items of the stack, considered in pairs. Of course, parallel operations of this sort must conserve the stack size, but any machine with parallel units of this sort would also have a general permutation network capable of "squeezing out" in parallel empty positions from the stack.

A traditional stack cache utilizes its space on the chip and memory
bandwidth better than a register bank of the same capacity. The stack
cache for a linear stack machine should be even more efficient, since
1) there are no stack frames or frame pointers; 2) all temporaries are
handled the same--both named and unnamed temporaries; and 3) the space
for each temporary is reclaimed immediately upon the use of its value
(i.e., the cache is "self-cleaning"). This last characteristic
guarantees that *all* of the data held in the stack cache is
live data and is not just tying up space.

Since Forth is usually implemented on a traditional von Neumann machine, one thinks of the return stack as holding "return addresses". However, in these days of large instruction caches, in which entire cache lines are read from the main memory in one transaction, this view should be updated. It is well-known that non-scientific programs have a very high rate of conditional branches, with the mean number of instructions between branches being on the order of 10 or less. Forth programs are also very short, with "straight-line" (non-branching) sequences averaging 10 items or less. In these environments, it makes more sense to view the return stack itself as the instruction buffer cache! In other words, the return stack doesn't hold "return addresses" at all, but the instructions themselves! When a routine is entered, the entire routine is dumped onto the top of the return stack, and execution proceeds with the top item of this stack. Since routines are generally very short, the transfer of an entire routine is about the same amount of work as transferring a complete cache line in present architectures. Furthermore, an instruction stack-cache-buffer is normally accessed sequentially, and therefore can be implemented using shift register technology. Since a shift register can be shifted faster than a RAM can be accessed, the "access time" of this instruction stack-cache-buffer will not be a limiting factor in a machine's speed. Executing a loop in an instruction stack-cache-buffer is essentially the making of connections necessary to create a cyclic shift register which literally cycles the instructions of the loop around the cyclic shift register.

We have shown a Lisp called *Linear Lisp* which implements a
"linear style" of programming suggested by *linear logic*
[Girard87]. We have shown this linear style to be competitive with
traditional Lisp styles for sparse polynomials
[Baker94SP]
and Quicksort
[Baker94QS].
Furthermore, the mapping of Linear Lisp onto a stack architecture
[footnote 5]
utilizes a frameless Forth-style stack, in which the basic stack
operations are *permutations*, rather than loads and stores. A
specific kind of permutation--`roll<n>`

--cycles the
top n items of the stack so that the n-th item becomes the top item.
These `roll`

instructions are perfectly matched to the
needs of linear logic to conserve values on the stack until they are
utilized exactly once. Furthermore, any duplication or deletion of
values are explicit operations that take place at the top of the
stack. Interestingly, the consistent use of these `roll`

(move-to-front) and `dup`

operators makes the stack into an
LRU buffer exactly like those used in "stack algorithms" for
simulating LRU caches [Mattson70]!

The idea of permuting the top several items of a stack by means of
a "rolling" operation seems to have occurred independently to a number
of people as a generalization of the `swap`

instruction
from the Burroughs stack machines [Lonergan61] [Hauck68]. Although
the `swap`

instruction was apparently introduced to
optimize the code for non-commutative operators (subtraction,
division) in arithmetic expressions, general rolling operators were
first used in HP calculators [Osborne93] and Forth implementations
[Moore93] to maximize the accessibility of operands while minimizing
real and conceptual space requirements. Although stack architectures
have been extensively written about, we have found *no*
references to the more general use of stack permutations to access
local variables
[footnote 6]
without the use of frame pointers. The literature on compiler
optimizations for stack machines is particularly sparse--probably
because the proponents of stack architectures touted trivial
translation. With the modern notion of an optimizing compiler as a
"partial evaluator", it is time to revisit the stack v. register
controversy.

[AdaLRM83] Reference Manual for the Ada(R) Programming Language. ANSI/MIL-STD-1815A-1983, U.S. Gov't Print. Off., Wash., DC, 1983.

Aho, A.V., and Johnson, S.C. "Optimal Code Generation for Expression Trees". J. ACM 23, 3 (July 1976), 488-501.

Aho, A.V., *et al.* "Code Generation for Expressions with
Common Subexpressions". J. ACM 24, 1 (Jan. 1977),
146-160.

Aho, A.V., *et al.* Compilers: Principles, Techniques
and Tools. Addison-Wesley, Reading, MA, 1986.

Allmark, R.H., & Lucking, J.R. "Design of an Arithmetic Unit Incorporating a Nesting Store". Proc. IFIP 1962, 694-698.

Backus, J. "Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs". Comm. ACM 21, 8 (1978), 613-641.

Baer, J.-L. Computer Systems Architecture. Computer Science Press, Potomac, MD, 1980.

[Baker91SP] Baker, H.G. "Structured Programming with Limited Private Types in Ada: Nesting is for the Soaring Eagles". ACM Ada Letters XI, 5 (1991), 79-90.

[Baker92LLL] Baker, H.G. "Lively Linear Lisp -- 'Look Ma, No Garbage!'". ACM Sigplan Notices 27, 8 (Aug. 1992), 89-98.

[Baker92] Baker, H.G. "NREVERSAL of Fortune--The Thermodynamics of Garbage Collection". Int'l W/S Mem. Mgmt, St Malo, 1992, Springer LNCS 637.

[Baker92MC] Baker, H.G. "Metacircular Semantics for Common Lisp Special Forms". ACM Lisp Pointers V, 4 (Oct-Dec 1992), 11-20.

[Baker93Steal] Baker, H.G. "How to Steal from a Limited Private Account". ACM Ada Letters XIII, 3 (May/June 1993), 91-95.

[Baker93ER] Baker, H.G. "Equal Rights for Functional Objects or, The More Things Change, The More They Are the Same". ACM OOPS Messenger 4, 4 (Oct. 1993), 2-27.

[Baker94QS] Baker, H.G. "A 'Linear Logic' Quicksort". ACM Sigplan Notices 29, 2 (Feb. 1994), 13-18.

[Baker94SP] Baker, H.G. "Sparse Polynomials and Linear Logic". ACM Sigsam Bulletin 27, 4 (Dec. 1993), 3-10.

Barton, R.S. "A New Approach to the Functional Design of a Digital Computer". AFIPS West. Jt. Computer Conf. 19 (1961), 393-396.

Barton, R.S. "Ideas for Computer Systems Organization: A Personal Survey". In Tou, J.T. Software Engineering, Vol. 1. Acad. Press, 1970.

Bell, C.G., *et al.* Computer Engineering: A DEC View of
Hardware Systems Design. Digital Press, Bedford, MA, 1978.

Blake, R.P. "Exploring a Stack Architecture". IEEE Comp. 10, 5 (1977), 30-39.

Brooks, F.P. "Recent Developments in Computer Organization". In Advances in Electronics and Electronic Devices. Acad. Press, NY, 1963.

Bruno, J., and Sethi, R. "Code Generation for a One-Generation Machine". J. ACM 23, 3 (July 1976), 502-510.

Bullman, D.M., *ed.* "Stack Computers". IEEE Comp.
10, 5 (May 1977), 14-52.

Carriero, N., & Gelernter, D. "Linda in Context". Comm. ACM 32, 4 (1989), 444-458.

Case, R.P., & Padegs, A. "Architecture of the IBM System/370". Comm. ACM 21, 1 (Jan. 1978), 73-96.

Castan, M., *et al.* "MaRS: A parallel graph reduction
multiprocessor". ACM Comput. Arch. News 16, 3
(June 1988), 17-24.

Cheese, A. "Combinatory Code and a Parallel Packet Based Computational Model". ACM Sigplan Not. 22, 4 (April 1987), 49-58.

Chirimar, J., *et al.* "Proving Memory Management Invariants
for a Language Based on Linear Logic". Proc. ACM Conf. Lisp &
Funct. Prog., San Fran., CA, June, 1992, ACM Lisp
Pointers V, 1 (Jan.-Mar. 1992), 139.

Church, A. The Calculi of Lambda-Conversion. Princeton U. Press, 1941.

Coffman, E.G., Jr., & Sethi, R. "Instruction Sets for Evaluating Arithmetic Expressions". J. ACM 30, 3 (July 1983), 457-478.

Collins, G.E. "A method for overlapping and erasure of lists". Comm. ACM 3, 12 (Dec. 1960), 655-657.

Cragon, H.G. "An Evaluation of Code Space Requirements and Performance of Various Architectures". ACM Comput. Arch. News 7, 5 (Feb. 1979), 5-21.

Curry, H., and Feys, R. Combinatory Logic. North-Holland, 1958.

Dahl, O.-J., Nygaard. "SimulaÐan Algol-Based Simulation Language". Comm. ACM 9, 9 (1966), 671-678.

Darlington, J., & Reeve, M. "ALICE--A Multiprocessor Reduction Machine for the Evaluation of Applicative Languages". Proc. FPCA'81.

Denning, P.J. "A Question of Semantics". ACM Comput. Arch. News 6, 8 (April 1978), 16-18. Discusses stack architectures.

Deutsch, L.P. "A Lisp Machine with Very Compact Programs". Proc. IJCAI 3, Stanford, CA, 1973.

Doran, R.W. "Architecture of Stack Machines". In Chu, Y.,
*ed.* High-Level Language Computer Architecture.
Acad. Press, NY, 1975, pp. 63-108.

Fasel, J.H., *et al, eds.* "Sigplan Notices Special Issue on
the Functional Programming Language Haskell". ACM Sigplan Not.
27, 5 (May 1992).

Francis, R.S. "Containment Defines a Class of Recursive Data Structures". ACM Sigplan Not. 18, 4 (Apr. 1983), 58-64.

Friedman, D.P., & Wise, D.S. "Aspects of applicative programming for parallel processing". IEEE Trans. Comput. C-27, 4 (Apr. 1978), 289-296.

Gabriel, R.P. "The Why of Y". ACM Lisp Pointers 2, 2 (Oct.-Dec. 1988), 15-25.

Girard, J.-Y. "Linear Logic". Theoretical Computer Sci. 50 (1987),1-102.

Glass, H. "Threaded Interpretive Systems and Functional Programming Environments". ACM Sigplan Not. 20, 4 (Apr. 1985), 24-32.

Goldberg, A., and Robson, D. Smalltalk-80: The Language and Its Implementation. Addison-Wesley, Reading, MA 1983.

Grabienski, P. "FLIP-FLOP: A Stack-Oriented Multiprocessing System". Proc. 2nd Symp. on Parallel Algs. and Archs., Crete, July, 1990, 161-168.

Halstead, R.H. "Overview of Concert Multilisp: A Multiprocessor Symbolic Computing System". ACM Comput. Arch. News 15, 1 (Mar. 1987), 5-14.

Hanson, C. "Efficient Stack Allocation for Tail-Recursive Languages". ACM Conf. Lisp & Funct. Progr., 1990, Nice, France, 106-118.

Harms, D.E., & Weide, B.W. "Copying and Swapping: Influences on the Design of Reusable Software Components". IEEE Trans. SE-17, 5 (1991), 424-435.

Harrison, P.G., & Khoshnevisan, H. "Efficient Compilation of Linear Recursive Functions into Object Level Loops". Symp. on Compiler Constr. '86, Sigplan Not. 21, 7 (July 1986), 207-218.

Hauck, E.A., and Dent, B.A. "Burroughs B6500/B7500 Stack Mechanism". AFIPS Spring Jt. Computer Conf. 32 (1968), 245-251.

Hayes, J.R., *et al.* "An Architecture for the Direct
Execution of the Forth Programming Language". Proc. ASPLOS II,
ACM Sigplan Not. 22, 10 (1987), 42-49.

Herlihy, M. "Wait-Free Synchronization". ACM Trans. on Progr. Lang. & Sys. 11, 1 (1991), 124-149.

Hesselink, W.H. "Axioms and Models of Linear Logic". Formal Aspects of Comput. 2, 2 (Apr-June 1990), 139-166.

Hewitt, C., & Baker, H. "Laws for Communicating Parallel Processes". IFIP 77 Proceedings, North-Holland.

Hill, D.D. "An Analysis of C Machine Support for Other Block-Structured Languages". ACM Comput. Arch. News 11, 4 (Sept. 1983), 6-16.

Hudak, P. "Conception, evolution and application of functional programming languages". ACM Comput. Surv. 21, 3 (Sept. 1989), 359-411.

Hughes, J. "Why functional programming matters". The Computer J. 32, 2 (April 1989), 98-107.

Ito, T., and Halstead, Jr., R.H., *Eds.* Parallel Lisp:
Languages and Systems. Springer LNCS 441, 1990.

James, J.S. "FORTH for Microcomputers". ACM Sigplan Not. 13, 10 (1978), 33-39.

Keedy, J.L. "On the Use of Stacks in the Evaluation of Expressions". ACM Comput. Arch. News 6, 6 (Feb. 1978), 22-28.

Keedy, J.L. "On the Evaluation of Expressions using Accumulators, Stacks and Store-to-Store Instructions". Comput. Arch. News 7, 4 (1978), 24-27.

Keedy, J.L. "More on the Use of Stacks in the Evaluations of Expressions". Comput. Arch. News 7, 8 (June 1979), 18-21.

Keller, R.M., and Lin, F.C. "Simulated Performance of a Reduction-Based Multiprocessor. IEEE Computer (July 1984), 70-82.

Keown, W.F., Jr., *et al.* "Performance of the Harris RTX 2000
Stack Architecture versus the Sun4 SPARC and the Sun3 M68020
Architecture". ACM Comput. Arch. News 20, 3
(June 1992).

Kieburtz, R.B. "Programming without pointer variables". Proc. Conf. on Data: Abstr., Defn. and Struct., Sigplan Not. 11 (special issue 1976), 95-107.

Kieburtz, R.B. "The G-machine: a fast, graph-reduction evaluator". Proc. IFIP FPCA, Nancy, France, 1985.

Kingdon, H, *et al.* "The HDG-machine: a highly distributed
graph-reducer for a transputer network". The Computer J.
34, 4 (1991), 290-301.

Koopman, P. Stack Computers, The New Wave. Ellis Horwood, Chichester, W. Sussex, England, 1989.

Koopman, P., & Lee, P. "A Fresh Look at Combinator Graph Reduction". ACM PLDI'89, Sigplan Not. 24, 7 (1989), 110-119.

Koopman, P.J., Jr., *et al.* "Cache Behavior of Combinator
Graph Reduction". ACM TOPLAS 14, 2 (April 1992),
265-297.

Lafont, Y. "The Linear Abstract Machine". Th. Comp. Sci. 59 (1988), 157-180.

Lawson, H.W., Jr. "Programming-Language-Oriented Instruction Streams". IEEE Trans. Comput. C-17, 5 (1968), 476-485. Discusses stacks.

Lieu, C.T., & Sarni, D. "O.Lisp Theoretical Basis and Description". ACM Sigplan Not. 25, 12 (Dec. 1990), 37-44.

Lipovski, G.J. "Just a Few More Words on Microprocessors of the Future". ACM Comput. Arch. News 6, 6 (Feb. 1978), 18-21.

Lonergan, W., & King, P. "Design of the B5000 System". Datamation 7, 5 (1961), 28-32.

Lunde, A. "Empirical Evaluation of Some Features of Instruction Set Processor Architectures". Comm. ACM 20, 3 (Mar. 1977), 143-153.

Marti-Oliet, N., & Meseguer, J. "From Petri nets to linear logic". Math. Struct. in Comp. Sci. 1, 1 (Mar. 1991), 69-101.

Mason, I.A. The Semantics of Destructive Lisp. CSLI LN 5, Stanford, CA, 1986.

Mattson, R.L., *et al.* "Evaluation Techniques for Storage
Hierarchies". IBM Sys. J. 9, 2 (1970), 78-117.

Maurer, W.D. "A Theory of Computer Instructions". J. ACM 13, 2 (1966), 226-235.

McKeeman, W.M. "Stack Processors". In Stone, H., *ed.*
Introduction to Computer Architecture. SRA, Chicago, 1975,
pp. 281-317.

Moore, C.H. "The Evolution of Forth, An Unusual Language". Byte 5, 8 (Aug. 1980), 76-92.

Moore, C.H. Personal communication, Aug.18, 1993.

Myers, G.J. "The Case Against Stack-Oriented Instruction Sets". ACM Sigarch News 6, 3 (Aug. 1977), 7-10.

Myers, G.J. "The Evaluation of Expressions in a Storage-to-Storage Architecture". Comput. Arch. News 6, 9 (June 1978), 20-23.

Nordstrom, D.J. "Threading Lisp". ACM Sigplan Not. 25, 2 (Feb. 1990), 17-24.

Omondi, A.R. "On function languages and parallel computers". Fut. Gen. Comput. Sys. 6, 4 (Sept. 1991), 355-372.

Osborne, Tom. Personal communication, Aug. 27, 1993.

Passia, J., and Lohr, K.-P. "Fips: A Functional-Imperative Language for Exploratory Programming". ACM Sigplan Not. 28, 5 (May 1993), 39-48.

Patterson, D.A., and Ditzel, D.R. "The Case for the Reduced Instruction Set Computer". ACM Comput. Arch. News 8, 6 (Oct. 1980), 25-32.

Peyton-Jones, S.L., & Salkild, J. "The Spineless Tagless G-machine". ACM FPCA 4, (1989), 184-201.

Ponder, C.G., *et al.* "Are Applicative Languages
Inefficient?" ACM Sigplan Not. 23, 6 (June 1988),
135-139.

Prabhala, B., & Sethi, R. "Efficient Computation of Expressions with Common Subexpressions". J. ACM 27, 1 (Jan. 1980), 146-163.

Randell, B., & Russell, L. Algol 60 Implementation. Acad. Press., 1964.

Rather, E.D., *et al.* "The Evolution of Forth". Proc.
HOPL-II, Sigplan Not. 28, 3 (Mar. 1993), 177-199.

Robison, A.D. "The Illinois Functional Programming Interpreter". ACM Symp. on Interp. and Interpretive Techs., Sigplan Not. 22, 7 (JulyÊ1987), 64-73.

Ryer, M.J. Programming in HAL/S. Intermetrics, Inc., Camb., MA, 1978.

Sarwar, S.M., *et al.* "Implementing Functional Languages on a
Combinator-Based Reduction Machine". ACM Sigplan Not.
23, 4 (Apr. 1988), 65-70.

Shalit, A. Dylan(tm): An object-oriented dynamic language. Apple Computer, Camb., MA, 1992.

Sites, R.L. "A Combined Register-Stack Architecture". ACM Comput. Arch. News 6, 8 (April 1978), 19.

Smith. A.J. "Cache Memories". ACM Comput. Surv. (Sept. 1982), 473-530.

Steele, G.L. Rabbit: A Compiler for SCHEME (A Study in Compiler Optimization). AI-TR-474, MIT AI Lab., May 1978.

[Steele90] Steele, G.L. Common Lisp, The Language; 2nd Ed. Digital Press, 1990.

Stevenson, J.W. "Efficient Encoding of Machine Instructions". ACM Comput. Arch. News 7, 8 (June 1979), 10-17.

Strom, R.E. "Mechanisms for Compile-Time Enforcement of Security". Proc. ACM POPL 10, Jan. 1983.

Stroustrup, B. The C++ Programming Language. Addison-Wesley, 1986.

Suzuki, N. "Analysis of Pointer 'Rotation'". Comm. ACM 25, 5 (1982), 330-335.

Tanenbaum, A.S. "Implications of Structured Programming for Machine Architecture". Comm. ACM 21, 3 (Mar. 1978), 237-246.

Vegdahl, S.R. "A Survey of Proposed Architectures for the Execution of Functional Languages". IEEE Trans. Comput. C-33, 12 (1984), 1050-1071.

Wadler, P. "Linear types can change the world!". Proc. IFIP TC2 Conf. on Progr. Concepts & Meths., April 1990.

Wadler, P. "Is there a use for linear logic?". Proc. ACM PEPM'91, New Haven, June 1991, 255-273.

Wakeling, D., & Runciman, C. "Linearity and Laziness". Proc. Funct. Progr. & Comp. Arch., LNCS 523, Springer-Verlag, Aug. 1991, 215-240.

Wirth, N. "Stack vs. Multiregister Computers". ACM Sigplan Not. (Mar. 1968), 13-19.

Below are the Linear Lisp examples as compiled into Postscript(tm)--a widely available Forth-like language--by our Linear Lisp compiler.

We first define various primitive stack operations.

/roll1 {} def /roll2 {exch} def
/roll3 {3 -1 roll} def /roll4 {4 -1 roll} def
/roll5 {5 -1 roll} def /roll6 {6 -1 roll} def
/roll7 {7 -1 roll} def /roll8 {8 -1 roll} def
/minusp2 {dup 0 lt} def
/cons {2 array astore} def
/carcdr {aload pop} def
/atom2
{dup type /arraytype eq {dup length 2 eq not} {true} ifelse}
def
/zerop2 {dup 0 eq} def
/sub1 {1 sub} def /add1 {1 add} def
/function {userdict exch get} def
/arrayp2 {dup type /arraytype eq} def
/last {dup length sub1 get} def
/closure {/funarg exch 2 add array astore} def
/closurep2
{arrayp2 not {false}
{dup length 0 eq {false}
{dup last /funarg eq}
ifelse}
ifelse}
def
/apply {closurep2 {aload pop pop exec} {exec} ifelse} def

We now show actual compiled examples.

/identity {} def
/five {pop 5} def
/square {dup mul} def
/quadratic
{roll3 dup roll4 dup neg roll2 square 4 roll5 roll6 mul mul
sub sqrt add 2 roll3 mul div}
def
/my-abs {minusp2 {neg} {} ifelse} def
/append
{roll2 atom2 {pop} {carcdr roll3 append cons} ifelse}
def
/append-kernel
{roll3 atom2 {pop pop}
{carcdr roll4 roll4 dup apply cons}
ifelse}
def
/append {/append-kernel function dup apply} def
/fact
{roll2 zerop2 {roll2 pop add1}
{dup roll3 dup roll3 sub1 roll3 roll3 apply mul}
ifelse}
def
/factorial {/fact function dup apply} def
/ifactorial
{1 roll2 {roll2 zerop2 {pop pop}
{dup roll4 mul roll2 sub1 roll3 dup
apply}
ifelse} dup apply}
def
/fact-kernel
{{roll2 zerop2 {pop pop 1} {dup sub1 roll3 apply mul} ifelse}
1 closure}
def
/y1
{{{roll2 dup apply roll2 apply apply} 2 closure}
1 closure dup apply}
def
/factorial {/fact-kernel function y1 apply} def

[Footnote 1] Even mathematical APL dispensed with traditional "operator precedences" when Iverson discovered that 1) most programs dealt with very few numerical expressions, and 2) people couldn't remember the precedence table.

[Footnote 2] We utilize the Postscript(tm) convention that the first argument is pushed first, and is therefore deepest in the stack--i.e., the stack grows to the right.

[Footnote 3]
This heavy use of explicit copying cries for `'`

as a
non-quoting character.

[Footnote 4]
*Typestates* [Strom83] can check linearity in complex control
structures.

[Footnote 5]
Linear languages also map elegantly onto *dataflow*
architectures.

[Footnote 6] [Harms91] [Suzuki82] [Baker92LLL] argue for swapping and permuting program variables. [Herlihy91] argues for swapping lock variables.