We show how to organize the wide variety of bit-vector operations in Common Lisp into a few generic types of processing, and show how word-wide parallelism can be utilized for almost all of these operations to achieve speedups over similar character operations of approximately the size of the word. The major exception to our presentation will be Common Lisp's "SEARCH" function for bit-vectors, which can also be speeded up [Baker89], but the subtlety and complexity of its implementation is outside the scope of this paper.
Symbolic processing can be roughly compared with numeric processing, where there are dense matrix techniques and sparse matrix techniques. Dense matrix techniques are best suited for relatively small matrices, where the underlying physical model is very closely coupled. Sparse matrix techniques are best suited for larger matrices where the underlying physical model is very loosely coupled. Dense matrix techniques assume that most matrix elements are non-zero, and hence that most operations involving those matrix elements are non-trivial. Sparse matrix techniques try to take advantage of the fact that a large fraction of the matrix elements are zero, and hence operations involving these elements are trivial. However, sparse matrices may still have relatively complex structure, even though a large portion of their elements are zero, so flexible methods must be used to represent and manipulate these sparse matrices. Most sparse matrix techniques, therefore, make use of pointers and linked structures which are much more reminiscent of symbolic than of numeric processing.
In symbolic processing, the underlying relationships are almost always sparse, hence the predominance of pointers and linked structures. Nevertheless, there are occasions where the relationships are so unstructured or so dense that pointer techniques are not optimal for symbolic processing. For example, the transitive closure of a binary relation is often dense, even if the original relation is not. In cases such as these, bit-vectors and bit-arrays can often be useful, if their implementation is efficient enough. ([Aït-Kaci89] performs transitive closures of object-oriented hierarchies using bit-vector techniques.)
Below is a straight-forward implementation in Common Lisp of the "or-and" multiplication of a bit-matrix by a bit-vector, which is useful in computing the image set resulting from the application of a binary relation to a domain set.
(defun mult-mat-vec (a v)
"Post-multiply bit-matrix a by bit-vector v."
(declare (type (array bit 2) a) (bit-vector v))
(assert (= (array-dimension a 1) (length v))
(let* ((m (array-dimension a 0))
(result (make-sequence 'bit-vector m :initial-element 0)))
(dotimes (i m)
(when (intersection-test (nmatrix-row a i) v)
(setf (elt result i) 1)))
result))
(defun nmatrix-row (a i)
"Returns a vector displaced to the i'th row of a matrix a."
(declare (type (array * 2) a))
(let ((n (array-dimension a 1)))
(make-array n :displaced-to a :displaced-index-offset (* i n))))
(defun intersection-test (v1 v2)
"Returns true if bit-vectors intersect one another."
(declare (bit-vector v1 v2))
(assert (= (length v1) (length v2))
(some #'logtest v1 v2))
Below is an implementation of Warshall's famous "in-place" algorithm for
computing the transitive closure of a bit-matrix in Common Lisp.
(Unfortunately, due to bugs in the implementation of displaced bit-arrays, this
elegant program will not run on some Common Lisp implementations.)
(defun nwarshall (a)
"Implements Warshall's algorithm for the transitive closure."
"Transitively closes the square bit matrix a in place."
"[Baase, p.223]"
(declare (type (array bit 2) a))
(assert (= (array-dimension a 0) (array-dimension a 1)))
(let ((n (array-dimension a 0)))
(dotimes (k n)
(let ((ak (nmatrix-row a k)))
(dotimes (i n)
(let ((ai (nmatrix-row a i)))
(unless (zerop (bit a i k))
(bit-ior ai ak t)))))))
a)
Common Lisp provides a number of representations and operations for bit vectors
and arrays. The most obvious is the "BIT-VECTOR" datatype, which is a subtype
of "ARRAY" in which the elements are restricted to "BIT"'s, i.e., the integers
zero and one, and in which the array rank (number of dimensions) is one.
Bit-vectors of this type (which we will call "array" bit-vectors) are also
Common Lisp "sequences" because they are one-dimensional arrays, so they
inherit all of the operations of the Common Lisp sequence functions. However,
there are a number of additional Common Lisp functions which operate on all
"bit-arrays"--i.e., arrays of bits. Curiously, even though Common Lisp has an
abbreviation for "(ARRAY BIT (*))" -- "BIT-VECTOR" -- and an abbreviation for
"(SIMPLE-ARRAY BIT (*))" -- "SIMPLE-BIT-VECTOR" -- it does not provide a
standard abbreviation for "(ARRAY BIT *)", for which the obvious abbreviation
would be "BIT-ARRAY". Finally, there is an extensive treatment of integers as
bit-vectors (which we will call "integer" bit-vectors) through their
2's-complement notation, including "logand", "integer-length", etc. (Logical
operations on Maclisp bignums were first implemented to support LINGOL
[Pratt73]
through the BBOOLE package
[Baker75],
and eventually found their way into the MIT Lisp Machine and Common
Lisp.)
Unfortunately for a potential user of bit-vectors in a Common Lisp program, these different representations and function suites have not been rationalized, so it may be difficult to decide a priori which representation to use in an application. To make things worse, some Common Lisp vendors provide such inefficient implementations of some of the bit-vector functions that they are useless for real programming.
Before delving into the efficient implementation of the various bit-vector
operations, we first give some perspective on the relationships among these
different operations through the following chart, which compares the operations
available for "integer" bit-vectors with those available for "array"
bit-vectors. (See [Eliot89] for a discussion of various representations for
sets in Common Lisp; note particularly the chart in his Figure 3.)
Operation Integer Version Array Version Sequence Version
Copy v unnecessary make-array :initial-contents v copy-seq
Make-all-zeros 0 make-array :initial-element 0 make-sequence :initial-element 0
Make-all-ones -1 make-array :initial-element 1 make-sequence :initial-element 1
Test all zeros zerop, not ldb-test N/A not find 1
Test all ones eql -1 N/A not find 0
Size integer-length array-dimension(s) length
Access i'th bit logbitp bit, sbit elt
Boolean fns boole N/A N/A
logand bit-and map #'logand
logior bit-ior map #'logior
logxor bit-xor map #'logxor
logeqv bit-eqv map #'logeqv
lognand bit-nand map #'lognand
lognor bit-nor map #'lognor
logandc1 bit-andc1 map #'logandc1
logandc2 bit-andc2 map #'logandc2
logorc1 bit-orc1 map #'logorc1
logorc2 bit-orc2 map #'logorc2
lognot bit-not map #'lognot
Intersect test logtest N/A some #'logtest
Subset test N/A N/A every #'<=
Select portion ldb,ash make-array :displaced subseq
Count bits logcount N/A count
Leftmost 1 integer-length N/A position
Mask a field mask-field N/A :start, :end
Reverse N/A N/A reverse, nreverse
Concatenate logior+ash N/A concatenate
Compare =, equal, equalp equal, equalp mismatch
Search subseq N/A N/A search
Block init N/A :initial-element ? fill
"BLT" blk trans dpb :initial-contents ? replace, setf-subseq
Chart of analogous operations for integers, bit-arrays and bit sequences.
We can see by this chart that there is a great deal of overlap between the
types of operations available for the various forms of bit-vectors. However,
there are also some obvious holes, such as the lack of a quick intersection
test for "array" bit-vectors, the lack of a subset test, and the lack of a
reverse operation on integers. (A reverse operation on integers would come in
handy for the implementation of FFT's, and would also aid in certain other
recursive decomposition tasks.)
The largest difference, however, between array-type bit vectors and integer-type bit vectors is in the fact that integer bit-vectors are functional, while array bit-vectors allow for side-effects. Thus, if n different pointers exist to an integer bit-vector, and one applies the "lognot" function to that bit-vector, then this operation only affects the result, and not any of the n existing pointers to the integer. On the other hand, "bit-not" with a second argument of "t" will flip all of the bits in the bit-vector given as its first argument--not only for the result bit-vector, but also for every other holder of a pointer to that argument bit-vector. Therefore, integer bit-vectors are better for functional programming, while array bit-vectors are better for state-oriented programming.
(Those in the functional programming camp may argue that the functional integer bit-vectors are good enough for all purposes. However, I am not aware of any Lisp compilers which are smart enough to solve the "aggregate update problem" for integer bit-vectors [Schwartz75][Hudak86], and therefore the manipulation of integer bit-vectors results in quite a bit more cons'ing than might be used when programming in a more imperative style utilizing array bit-vectors. Without a generational garbage collector [Moon84], the excellent efficiency of the integer bit-vector operations is nullified by the large amount of garbage collection that results from their use.)
Given that we would like a complete complement of bit-vector operations for either style of programming, there are two ways to fill the holes in our chart--add additional operations for both types of bit-vectors and/or add a conversion function which offers the efficient conversion of one type into the other. Since we want to allow for the widest possible flexibility in the use of the Common Lisp language, we suggest doing both--filling the holes in the chart with new functions and providing for the interconvertability of integers and array bit-vectors (see also [Eliot89] for a similar proposal).
(There are often other differences between integer bit-vectors and array bit-vectors due to the vagaries of an implementation. E.g., because Coral Common Lisp on the Macintosh allocates temporary space from the run-time stack for all integer operations, its operations are very fast, but its integers are also limited to 32K bytes, which is quite large enough for most integer calculations, but not large enough for many multi-dimensional bit-array applications.)
Some languages other than Common Lisp offer bit-vector facilities for bit-vectors which exceed the word length of the underlying hardware. The original Pascal implementation offered "sets" up to size 60, thanks to Control Data. The Ada language [AdaLRM] offers the equivalent of Common Lisp's bit-not, bit-and, bit-ior, bit-xor, equal, subseq, copy, concatenate, fill, and replace as builtin capabilities for arbitrary-length bit-vectors. Interestingly enough, efficient Ada compilers must deal with the full complexity of packed bit-vectors which can occur with any bit-offset to a word boundary, just as we describe below. Ada also offers one builtin capability not available in Common Lisp--a lexicographic comparison, although such a comparison can easily be emulated using mismatch. Of course, the APL language--as the inspiration for many Common Lisp functions--offers many interesting and useful functions for manipulating bit-vectors, some of which are missing in Common Lisp--scan and bit-vector-reduce--which can often be quite useful.
By efficient, we mean that we would like to take advantage of the SIMD-type "word parallelism" that exists on all modern serial computers. In other words, if we can operate on 16-bit words at a time (called "Whiz-Along-By-Words" in [White86]) instead of single bits, we would like to achieve a 16-fold speedup over the fully serial operation operating on only a single bit at a time. Clearly, the larger the word size that we can utilize, the faster these operations should go.
There are some complications, however. Very few computers offer addressing capabilities down to the individual bit, and demand that words be addressed on "word boundaries", so there are the possibilities that incomplete words need to be processed, and that two bits which are to be combined may reside at different bit locations within a word. Both of these complications will require additional processing time, and add substantial complexity to the task of efficient implementation. Yet the payoff of 16-64 X speedups is far too important to ignore.
Virtually all processing of bit-vectors in Common Lisp is stream-like, involving one or more input streams of bits, and zero or one output streams of bits. (One can easily conceive of interesting operations involving multiple output streams of bits, such as a dual complement/reverse complement operation which treats the bit-vectors as sets, but none of the standard Common Lisp bit-vector operations requires more than one output bit stream.) Depending upon the operation, these bit-streams are derived from the bit-vector by processing from the start of the vector to the end, or from the end of the vector to the start. The bit-stream may also need to be inverted in the process of being read or written.
The key to the efficient processing of bit-vectors is in the ability to process more than one bit at a time. Therefore, these bit streams need to be able to read or write more than one bit at a time (White calls these chunks "bigits" [White86]). Ideally, one would like to specify the number of bits to be processed--or "byte size"--perhaps 8 bits at a time for one operation, or 32 bits at a time for another operation. The complications of word boundaries occurs when processing multiple bits--sometimes one must process less than the full word width, and sometimes the word to be processed straddles a word boundary.
(By word, we mean the smallest unit which can be efficiently read and written to memory. This does not necessarily mean the smallest addressable unit, as many modern "CISC" chips can address a 32-bit word occurring on any 8-bit boundary, but the maximum efficiency is reached when 32-bit words are read and written on 32-bit boundaries. By byte, we mean some number of bits which has been chosen as convenient for a particular type of processing; our bytes do not necessarily have 8 bits.)
Vanilla Common Lisp already has an excellent model for bit streams which could theoretically be used for these purposes--the "binary" file. Binary files in Common Lisp are homogeneous sequences of "bytes", but since the byte size need not remain the same for all uses of the file, the only portable implementation is as a sequence of bits. One could conceive of setting up a binary file associated with each bit-vector which needed to be processed, which would in turn set up a "byte-stream" which could be read or written, queried for end-of-file, etc. Such a stream could easily be set up to read either from the beginning or the end, and to perform on-the-fly complementation.
A true Common Lisp "stream" implementation of the bit-vector operations would be very easy to write code for, and the code would be quite readable. However, such an implementation would be even more inefficient than a straight-forward serial implementation of the same functions. This is because most implementations of streams in Common Lisp systems implement stream operations as function calls, and expect to perform quite a bit of processing per byte read or written, so the relative overhead of the stream operations is not normally objectionable. However, in the implementation of bit-vector operations, such overhead would be intolerable, since the cost of a bit-operation is essentially free compared with the cost of reading and writing the data.
However, rather than "throw out the baby with the bath water", we can keep the "byte stream" model of processing, but implement it with some extremely efficient macros instead of true Common Lisp streams, and thereby achieve intellectual economy together with high execution speed.
(A problem occurs in trying to implement these byte stream macros in portable Common Lisp--there is no efficient way to access the representation of a bit-vector where the actual bits are stored. Coral Common Lisp offers an escape--some open-coded "sub-primitives" which enable us to access 8-bit, 16-bit and 32-bit "bytes" at any offset within an object--irrespective of the type of the object. While using these operations involves delving unnecessarily deeply into the gory details of the representation of bit-vectors, Common Lisp provides no other intermediate level access to the actual bits. The basic nature of these routines is roughly equivalent to ldb for a particular subset of sizes and positions, except that we operate on bit-vectors instead of integers. Perhaps a future Common Lisp revision can incorporate such an operation.)
At the same time that we implement a new set of specialized and efficient byte-stream macros, we can also handle one of the complications of word addressing at the same time. While it is obvious that the last byte of a sequence may sometimes not be a full byte, but only a partial byte, there are times when it is more efficient to also process the first byte as only a partial byte. This is for "synchronization" reasons, when the processing of the middle (neither first nor last) bytes is thereby speeded up. Since we expect that bit-vectors will sometimes be quite long, it is important that the middle bytes be processed in the most efficient manner. Therefore, in the setting up of such a byte stream, we must separately specify both the offset within a byte of the beginning of the first byte as well as the length of the first byte.
What different capabilities will be required of our "byte-streams"? We will obviously need to read and write them, but less obvious is the need to "update" them in place. Due to the possibility of the "t" argument in the "bit-xxx" bit-array logical operations, it will be necessary to be able to read, then write back into the same location, the result of an operation. We will also have to read backwards (although we will keep the order of the bits within the byte the same as the forwards order), and we may also want to be able to complement a byte after reading or before writing.
Due to the fact that during any builtin Common Lisp bit-vector operation, at most one bit stream will be written (or updated), we can always "synchronize" all of the bit streams being read to the one being written. Once this synchronization has been performed, we can always write words on word boundaries, and thereby avoid a large amount of complexity and inefficiency, because writing across word boundaries entails more inefficiency than reading across word boundaries. Therefore, only the first and last byte on a write (or update) stream will be less than a full byte, and even that byte will never cross a word boundary. Thus, for implementing just the standard Common Lisp bit-vector functions, we need not implement the most general possible write/update stream.
(Note that even if a computer offered hardware addressing to the single bit, and allowed the hardware reading and writing of words at any bit boundary, it would still be much more efficient for the software to operate on full-word boundaries. This is because while the hardware reduces the software complexity in terms of the number of instructions executed, it is still slower, due to the larger number of cache misses and memory accesses.)
Synchronization of different byte streams is the process of choosing a stream which will be processed (except for its first and last bytes) on a word boundary, and then computing the appropriate offsets for the other streams so that the corresponding bits become aligned for processing. With a single stream, we can always synchronize the stream to itself, while if there are two streams, we can synchronize one to the other. (As we pointed out above, we will always synchronize to the write stream because writing unaligned streams is more expensive than reading unaligned streams.) With the possibility of up to two streams for reading and one stream for writing required to implement some Common Lisp bit-vector operations, we have the possibility that all three streams will have different alignments, so we force both read streams to synchronize to the write stream. We call the case of similar alignments "aligned" in our benchmarks, and the case of different alignments "unaligned".
Many implementors who first approach Common Lisp bit-arrays assume that because the definition of the "bit-xxx" functions (e.g., "bit-and", "bit-ior") require the same rank and the same dimensions, that they will always be "synchronized" in the sense that corresponding bits from the different arrays will occur at the same location within a word. However, this is not the case when one or more of the arrays is "displaced", as the "index-offset" of a displaced array can be any non-negative integer, which can therefore position the first element of the array at any bit within a word. In addition, these functions must also be careful not to disturb bits earlier in the word before the first element or later in the last word, because the underlying simple-array may also be accessible to the user, and bits which are not participating in the "bit-xxx" function should not be affected. The "nwarshall" algorithm shown above depends critically upon the correct implementation of "bit-xxx" functions on displaced arrays.
(We note that most modern workstations already have extremely efficient implementations of the "bit-xxx-t" functions under the guise of the "bitblt" function used to manipulate the two-dimensional bitmap of the screen. In fact, Symbolics Common Lisp utilizes these functions to implement the "bit-xxx" functions, and they are very fast; curiously, Symbolics sequence functions on bit-vectors are abysmally slow (ca.Rel. 7.0). However, bitblt (at least as originally envisioned by the Xerox Alto) is a two-dimensional operation involving only two arguments, and translating its use to a single dimension and three arguments for use within Common Lisp can require some care. Coral Common Lisp, for example, does not handle the case of unaligned arguments correctly. When special purpose bitblt hardware is available for use within normal memory, its could provide the effect of an array processor for these Common Lisp functions. One must be careful, however, of potentially long startup overheads for these 2-dimensional operations; Common Lisp bit-vectors which are short also want to be efficiently manipulated.)
We note that several proposals [Moon] have been made to modify the nature of the standard Common Lisp bit-vector functions. We feel that our techniques will not be affected in any major way by these proposals, so we have retained the original 1984 Common Lisp semantics [Steele84].
We will now go through the Common Lisp bit-vector functions in turn and indicate how they can be efficiently implemented. All times given are for Coral Common Lisp 1.2 running on a Macintosh Plus with a 16MHz Radius 68020 Accelerator card and 4Mbytes of main memory.
Our implementation of find/position in Coral Common Lisp utilizes a byte-size of 16 bits (hence processes the bit-vector 16 bits at a time), and operates in about .5usec/bit processed.
Our implementation of "find1" in Coral Common Lisp utilizes a byte-size of 16 bits (hence processes the bit-vector 16 bits at a time), and operates in .46usec/bit processed, while "find0" requires .62usec/bit processed.
Equal/mismatch requires two read streams, with the second being synchronized to the first, so that at least one stream is processed on word boundaries. When a non-equality is found, then mismatch calls "position1" on the logxor of the two words to find the position of mismatch.
Our implementation of equal/mismatch requires .82usec/bit for aligned streams, and 1.41usec/bit for unaligned streams.
Our implementation of fill requires .45usec/bit processed.
Replace is the closest operation Common Lisp has to a "bitblt" or "bit-block-transfer" operation. Since Common Lisp requires that replace "work correctly" even when source and destination overlap, we will require both a normal and a "from-end" version of replace, since when we need to move a subsequence of a bit-vector up within the same vector, and the destination overlaps the source, we will lose information if we transfer information from the start rather than from the end.
Whether replace-ing from the start or from the end, we always synchronize the source to the destination.
Both versions of replace take the same amount of time in our implementation: .78usec/bit when source and destination are aligned, and 1.3usec/bit when unaligned.
Bit-not-t takes .89usec/bit, while bit-not-2 takes .99usec/bit aligned and 1.57usec/bit unaligned.
The bit-xxx-3 functions have two read streams and a write stream, and come in both from-start and from-end versions. If a source overlaps with the destination, and both sources are higher than the destination, then we use the from-end version. The worst case occurs when the destination overlaps both sources, and yet lies in between them. In this case, we can neither process from the beginning or from the end without getting into trouble. Therefore, we either allocate a temporary on the stack or on the heap (depending upon the size), and move the offending source to the temporary location (using replace) and then utilize the bit-xxx-3 function on the now-tractable situation.
(We note that it is theoretically possible to perform bit-xxx-3 in-place without using additional storage even in the worst case of overlap of both sources and the destination when the destination is between the two sources. However, this algorithm requires a group-theoretic approach to performing the updates in certain cycles, in a manner similar to algorithms which have been proposed for transposing non-square matrices in place without additional storage [Brenner]. Such a method is far too complicated (and slow, since it tends to operate on a single element at a time) for use in a real Common Lisp implementation.)
Bit-and-t takes 1.04usec/bit for aligned sources and 1.55usec/bit for unaligned sources. Bit-and-3 takes 1.1usec/bit for aligned sources and destination and 2.16usec/bit for all three unaligned. Add an additional .78usec/bit for the additional copy for the worst case.
Despite their similar names, a different technique is required in order to perform reverse and nreverse. We actually denote the functions "reverse-2" and "reverse-t". Reverse-2 is called to move and reverse a bit-vector to a destination known not to overlap with the source. One "from-end" stream is used to read, while a normal "from-start" stream is used to write the destination. The read stream is synchronized to the write stream. A 256-element table is used to reverse the bits within an 8-bit byte, and this table is used twice to reverse the bits in a 16-bit word.
Reverse-t requires a subsidiary function which reverses a byte-vector in place. We first save the non-participating bits from the first and the last bytes, then reverse the entire byte-vector in place. However, the bits in this vector may not be properly aligned, so we then call replace to shift the appropriate bits up or down, if required. Finally, the non-participating bits are restored.
Reverse-2 takes 1.8usec/bit for word-aligned bit-vectors and 2.4usec/bit for non-word-aligned bit-vectors. Reverse-t takes 1.7usec/bit for byte-aligned bit-vectors and 3.1usec/bit for non-byte-aligned bit-vectors.
We utilize two read streams similar to equal or mismatch, with the second stream synchronized to the first. Whenever the logical and of the two bytes read is non-zero, then we immediately stop. In the case where the intersection is likely to occur towards the end of the vectors, there may be a requirement for a "from-end" version.
Our test-bit-vector requires .95usec/bit for aligned bit-vectors, and 1.41usec/bit for unaligned bit-vectors.
Our implementation of count in Coral Common Lisp requires .467-1.53 usec/bit processed.
For these reasons, we do not perform in-place deletions, and "delete" and "delete-duplicates" are just additional names for "remove" and "remove-duplicates", respectively.
Nsubstitute is easier than substitute, as we need only call "fill" on the affected portion of the bit-vector to make the affected portion either all 0's or all 1's.
Merge of two bit-vectors is slightly more efficient, since we need only find the first 1 in each sequence rather than count the 1's before constructing the answer.
CCL CCL Nimble Nimble Speedup Operation simple dsplcd aligned unaligned Factor (X faster) subseq 102 112 --use replace-- 78-144 copy-seq 86 94.2 --use replace-- 66-121 reverse 3.4 84 1.8 2.4 1.4-47 nreverse 2.3 93 1.7 3.1 .74-55 make-seq :initial 45 -- --use fill-- 100 fill 46.8 56 .45 .45 104-124 replace 95.2 117 .78 1.3 73-150 remove nothing 133 142 .933-1.983 67-152 remove everything 94 106 .467-1.533 61-227 remove-duplicates 635 665 .5-.733 866-1330 substitute nothing 135 142 --use replace-- substitute everything 140 147 --use fill-- find1 90.2 100.3 .46 .46 196-218 find0 90.2 100.3 .62 .62 145-162 position 89.5 98.3 .5 .5 179-197 count 90.8 100.3 .467-1.53 .467-1.53 59-215 mismatch 154 175 .917 1.43 107-191 equal 70.3 93 .85 1.43 49-109 sort appr 94*n*log2n .92*n to 2.0*n 103-infinite intersection test 260 -- .95 1.4 186-274 bit-not-t .94 bad ans..89 .89 1.06 bit-not-2 .94 bad ans..99 1.57 .95 bit-and-t .99 bad ans.1.04 1.55 .95 bit-and-3 1.0 bad ans.1.1 2.16 .91 mult-mat-vec 10^3x10^3 approx. 5 min. 1 second worst case nwarshall 100x100 bad ans. 1.6 seconds worst case Timing Comparison of Coral Common Lisp (CCL 1.2) with Nimble bit-vector operations (All times are in usec/bit processed on 4Mbyte Mac+ w/16MHz 68020 accelerator)
This implementation is not the fastest possible for the Macintosh hardware: perhaps an additional factor of 5-8 could be accomplished through a combination of moving from a byte-size of 16 to a byte-size of 32 and recoding in assembly language. On the new pipelined RISC processors, additional factors can be gained through the proper scheduling of operations within the pipeline and defeating the cache (which is no help when accessing FIFO streams). We estimate that the new generation of 25+ mips serial processors now coming on stream should achieve times which are 100-1000 times as fast as these shown for the Macintosh -- i.e., flirting with 1 GBOP (giga-bit-operations/second). Finally, bit-vector operations on true SIMD architectures such as the "DAP" and the "Connection Machine" can gain addition factors of 100 on bit-vector operation speeds.
Aït-Kaci, Hassan; Boyer, Robert; Lincoln, Patrick, and Nasr, Roger. "Efficient Implementation of Lattice Operations". ACM TOPLAS 11,1 (Jan. 1989),115-146.
Baase, Sara. Computer Algorithms: Introduction to Design and Analysis. Addison-Wesley, Reading, Mass., 1978.
Baker, Henry. ML:HGB;BBOOLE >. MIT AI Lab., 1975.
Baker, Henry. "The Efficient Implementation of Common Lisp's SEARCH Function on Bit-vectors". Internal Memorandum, Nimble Computer Corporation, 1989.
[Baker92] Baker, Henry. "A Decision Procedure for Common Lisp's SUBTYPEP Predicate". J. Lisp and Symbolic Computation 5 (1992), 157-190.
Brenner, Norman. "Algorithm 467: Matrix Transposition in Place". CACM 16, 11 (Nov. 1973),692-694.
Eliot, Christopher R. "Manipulating Sets in Common Lisp". Lisp Pointers 2, 3&4 (Jan.-June 1989),5-14.
Hudak, Paul. "A Semantic Model of Reference Counting and its Abstraction". Proc. 1986 ACM Conf. on Lisp and Funct. Prog., August,351-363.
Moon, David. "Garbage collection in a large Lisp system". Proc. 1984 ACM Conf. on Lisp and Funct. Prog.,235-246.
Moon, David. "BIT-ARRAY-FUNCTIONS", proposal for modifying the Common Lisp standard. June, 1989.
[Pratt73] Pratt, Vaughan R. "A Linguistics Oriented Programming Language". Proc. IJCAI 3, (Aug. 1973),372-380.
Schwartz, J.T. "Optimization of very high level languages, Part II: Deducing relationships of inclusion and membership". Computer Languages 1,3 (1975),197-218.
[Steele84] Steele, Jr., Guy L. Common Lisp: The Language. Digital Press, Burlington, Mass., 1984.
White, Jon L. "Reconfigurable, Retargetable Bignums: A Case Study in Efficient, Portable Lisp System Building". Proc. 1986 ACM Conf. on Lisp and Funct. Prog., August,174-191.