Let Over Lambda -- 50 Years of Lisp

by Doug Hoyte

Macro Efficiency Topics

Lisp Is Fast

If you tile a floor with tiles the size of your thumbnail, you don't waste many. —Paul Graham

Some people think lisp is slow. While this may have been true for some early lisp implementations, it has been demonstrably false for many years. In fact, modern lisps like COMMON LISP have been designed to let us use macros to make lisp fast. Really fast. The goal of this chapter might surprise you if you believe in the performance myth that says that low-level languages are more efficient than lisp. This chapter aims to illustrate that lisp can be faster than any other programming language and that low-level programming languages like C are actually—because they lack macros—at a performance disadvantage to lisp. Lisp allows us to write code that is more efficient than the code we would be obliged to write in any other language. Especially for large and complicated programs, macros allow the creation of code with definite performance advantages over Blub languages. Sometimes our languages are designed so as to have efficient implementations, but more often they are designed to provide maximum expressive power to the programmer. When we do choose efficiency, lisp is fast. Really fast.

While other languages give you small, square tiles, lisp lets you pick tiles of any size and of any shape. With C, programmers always use a language that is directly tied to the capabilities of a fancy fixnum adder. Aside from procedures and structures, little abstraction is possible in C. By contrast, lisp was not designed around the capabilities and limitations of machines at all.

But surely other languages can be written in less efficient, more convenient ways. After all, Perl allows a programmer to perform miracles with a single dense expression but also provides many upgrade paths for faster code. So what does it mean when we say that lisp allows us to control the efficiency of our abstractions like no other language? As you might have come to expect by now, the answer involves the topic of our book: macros.

Instead of inquiring what makes a program fast, it's better to ask what makes a program slow. This is one of the most researched topics in programming. The root causes can be roughly classified into three broad categories: bad algorithms, bad data-structures, and general code.

All language implementations need good algorithms. An algorithm is a presumably well-researched procedural description of how to execute a programming task. Because the investment required in coming up with an algorithm is so much larger than that of implementing one, the use of algorithms is ubiquitous throughout all of computer science. Somebody has already figured out how, and why, and how quickly, an algorithm works; all you have to do to use an algorithm is translate its pseudo-code into something that your system can understand. Because COMMON LISP implementations have typically been well implemented by smart people and continuously improved upon for decades, they generally employ some of the best and quickest algorithms around for most common tasks. For instance, CMUCL uses a tuned heap sort implementation for sorting lists and vectors and the Mersenne Twister 19937 algorithm and its ridiculously large period¹ ^{(1- (expt 2 19937))} for generating random numbers².

Good data structures are also necessary for any decent programming language. Data structures are so important that ignoring them will cause any language implementation to slow to a crawl. Optimising data structures essentially comes down to a concept called locality. Explaining this concept is easy—data that is accessed most frequently should be the fastest to access. Data structures and locality are so important that they can be observed clearly at almost every level of computing where performance gains have been sought: large sets of CPU registers, memory caches, data-bases, and caching network proxies are a few highlights. Lisp offers a huge set of standard data structures and they are generally implemented very well. Hash tables, linked lists (obviously), vectors with fill pointers, packages with internable symbols, and much more is specified, available, and well implemented for COMMON LISP programmers to take advantage of.

If lisp provides such good algorithms and data-structures, how is it even possible that lisp code can be slower than code in other languages? The explanation is based on the most important design decision of lisp: general code, a concept otherwise familiar to us as duality of syntax. When we write lisp code, we use as many dualities as possible. The very structure of the language encourages us to. Part of the reason why lisp programs are usually so much shorter than Blub programs is because any given piece of lisp code can be used for so much more than can a corresponding piece of Blub code so you can re-use it more often. Coming from a Blub language perspective it can feel unusual to have to write more to get less, but this is the important lisp design decision we have been talking about—duality of syntax. The more dualities attached to each expression, the shorter a program seems to be. So does this mean that to achieve or exceed C's performance we need to make our lisp programs as long and dangerous as their corresponding C programs? No. Lisp has macros.

Macros Make Lisp Fast

This section shows examples of using three types of macros to help create efficient programs: regular macros, read macros, and a new type of macro introduced here: compiler macros.

Macros can be used to control the algorithms, data-structures, type checks, safety checks, optimisation levels of code or portions of code, and much more. We can have safe and general code co-existing in the same program—or even function—as fast and dangerous code. In short, no other language provides an open interface for controlling the compiler in the way that lisp does, and it's all thanks to (what else?) macros. A cursory reading of the ANSI standard seems to suggest that macros and declarations, the most direct way to communicate with your compiler, don't work well together:

Macro forms cannot expand into declarations; declare expressions must appear as actual subexpressions of the form to which they refer.

What ANSI means is that the following macro will not work as desired:

(defmacro go-fast () ; Broken!
  '(declare (optimize (speed 3) (safety 0))))

We can't put macro invocations in places where declarations are expected. Another way to think about this is that the system's code-walker is not required to expand macros in the bodies of special forms before it checks for declarations. Wanting to go fast is such a common thing to declare so maybe we can do even better than the flawed go-fast macro above. When we want to compress meaning as much as possible, often we want a read macro. Read macros are also suitable for expanding into declarations because they are expanded long before the code-walker attempts to walk the code. They read in as actual subexpressions.

SHARP-F

(set-dispatch-macro-character #\# #\f
   (lambda (stream sub-char numarg)
     (declare (ignore stream sub-char))
     (setq numarg (or numarg 3))
     (unless (<= numarg 3)
       (error "Bad value for #f: ~a" numarg))
     `(declare (optimize (speed ,numarg)
                         (safety ,(- 3 numarg))))))

Sharp-f is a read macro that can be used to control the most important performance trade-off available to COMMON LISP programs: the balance between the speed and the safety declarations. For instance, sharp-f by itself reads in as what we would have liked go-fast to expand into:

* '#f

(DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0)))

However, we can change this and declare that we value safety above speed by providing a number less than 3 as the reader number argument. All dispatch read macros can accept a numeric argument like this. It is passed as the third argument, often called numarg, to the read macro function. Here is an example where we value safety over speed by providing 0:

* '#0f

(DECLARE (OPTIMIZE (SPEED 0) (SAFETY 3)))

Values of 1 and 2 can also be passed resulting in the following declarations. The advantages of these different declaration settings are very compiler dependent so you will almost never use them:

* '(#1f #2F)

((DECLARE (OPTIMIZE (SPEED 1) (SAFETY 2)))
 (DECLARE (OPTIMIZE (SPEED 2) (SAFETY 1))))

Although macros can't directly expand into declarations, we can still use regular macros to control declarations. Because the code-walker can't walk a macro form to search for declarations until it has expanded that macro, there is no way that it can tell if this declaration was an actual subexpression of a form you wrote or if the macro added the declaration when it was expanded.

FAST-PROGN

(defmacro fast-progn (&rest body)
  `(locally #f ,@body))

SAFE-PROGN

(defmacro safe-progn (&rest body)
  `(locally #0f ,@body))

Fast-progn and safe-progn are simple examples of macros that expand into forms containing declarations. Notice that we use locally's implicit progn instead of progn itself because progn doesn't accept declarations³. These two macros use the sharp-f read macro we defined previously. We can use these forms as a version of progn where its enclosed expressions are optimised for speed (but are dangerous) and a version where they are ensured safe (and possibly slow):

* (macroexpand
    '(fast-progn
       (+ 1 2)))

(LOCALLY
  (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0)))
  (+ 1 2))
T

We can also provide other declarations in the macro arguments because their location is not and cannot be verified until the macro is expanded:

* (macroexpand
    '(fast-progn
       (declare (type fixnum a))
       (the fixnum (+ a 1))))

(LOCALLY
  (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0)))
  (DECLARE (TYPE FIXNUM A))
  (THE FIXNUM (+ A 1)))
T

When experimenting with macro expansions, sometimes we would like to see what happens when we embed them into various lexical contexts. Combining the read-time evaluation macro described in section 4.1, Run-Time at Read-Time with the * variables that keep the last three REPL results available lets us see that our form evaluates as expected:

* (let ((a 0))
    #.*)

1

But notice that although the above evaluated correctly, declarations are sometimes only fully considered for compiled code. For example, since our evaluation above interpreted the code⁴ ^{In most implementations.} , it will probably ignore the safety declaration and go ahead to promote an overflowed result to a bignum. Let's see if this happens here:

* (let ((a most-positive-fixnum))
    #.**)

536870912

It does. CMUCL ignores declarations for interpreted code. We want to continue playing with our expression currently in ***, but since we're not sure if we're going to get it on this next try, let's bring it back to * so we don't lose it:

* ***

(LOCALLY
  (DECLARE (OPTIMIZE (SPEED 3) (SAFETY 0)))
  (DECLARE (TYPE FIXNUM A))
  (THE FIXNUM (+ A 1)))

There it is. So we now have three more chances to make it work. Let's try compiling it to see if we can observe fixnum wrapping:

* (funcall
    (compile nil
      `(lambda ()
         (let ((a most-positive-fixnum))
           ,*))))

; Warning: This is not a (VALUES FIXNUM &REST T):
;   536870912

536870912

Hm, what gives? Didn't we tell lisp to not check for this? Reasoning about declarations is further complicated by compile-time optimisations like constant folding. What happened was that when lisp compiled the form it was able to perform the addition at compile-time since we are adding constants and therefore it knows the result will also be constant and there is no need to calculate it at run-time. When lisp did this it saw that our declaration for a being a fixnum will definitely be wrong. The warning is lisp's way of telling us "You dummy, I'm ignoring your declaration because you can't be trusted." If we slightly shuffle the expression around so that lisp can't fold any constants, we can finally see the fixnum wrapping in effect:

* (funcall
    (compile nil
      `(lambda (a)
         ,**))
    most-positive-fixnum)

-536870912

Another important property of declarations is that they can shadow other declarations in the same sense that lexical variables can shadow other lexical variables. For instance, we might like to write a macro that performs safety checks even when it is embedded in code declared unsafe:

(defmacro error-checker ()
  `(safe-progn
     (declare (type integer var))
     do-whatever-other-error-checking))

Wrapping one more layer, we can use these macros to add error checking code around some code that needs to run quickly instead of safely by nesting another use of the other of these macros, fast-progn:

(defun wrapped-operation ()
  (safe-progn
    do-whatever-error-checking
    (fast-progn
      but-this-needs-to-go-fast)))

Safely verifying parameters with an error checking region surrounding a fast implementation of some functionality is a common pattern in high performance lisp code. Especially for iterative procedures like array traversal, dramatic improvements to run-time performance can be achieved by doing error checks like type and bounds checking up front at the start of an operation, then omitting them where possible while performing it.

COMMON LISP is first and foremost designed to be powerful to program; efficiency is a distant secondary concern. However, these features, power and efficiency, don't necessarily represent a trade-off. With macros we can apply the power of lisp to solving efficiency problems. In addition to regular macros and read macros—which by themselves already provide considerable power—COMMON LISP also provides compiler macros. Compiler macros are macros in the same sense that the other types of macros are: they are programs that program. Compiler macros aren't well described in most lisp tutorials which is indicative of just how often performance is a priority to programmers (almost never). However, compiler macros are elegant solutions to certain classes of efficiency problems and deserve to be in every lisp professional's tool-kit.

Compiler macros define transformations that your lisp compiler will apply to (named) function invocations. That means that you can take a function created with defun and tell lisp that instead of compiling calls to this function, it should instead compile some arbitrary code indicated by the compiler macro. Why use a function in combination with a compiler macro instead of just writing a macro with that name in the first place? The first, less important reason is that this allows us more control over when to absorb the overhead of compilation. In particular, COMMON LISP doesn't specify when or how often a macro will be expanded. In interpreted code it is possible that a macro will be expanded every single time it is invoked⁵. When doing compile-time optimisation we want to perform a (presumably lengthy and expensive) calculation before running the function so as to reduce the amount of calculation the function itself has to do. Compiler macros give us a means of performing a lengthy compilation calculation exactly once, when we compile the code—as it should be.

But more importantly than performing the compilation calculation only once and at the right time, compiler macros are useful because they introduce dualities of syntax into the language. Compiler macros let us add a dual meaning to any form representing a (named) function call. In addition to its regular meaning, a compiler macro adds a compiled meaning. You are very strongly advised to ensure that your compiled meaning implements the same task as your regular meaning, but you are free to vary how it performs it (that's the point). The advantage of using dual syntax is that we can make changes to the efficiency of code without having to modify that code at all. We can take an existing code base—one that presumably uses lots of function calls—and change how that code is compiled by introducing dual syntax. All we have to do is find the function invocations that are unacceptably costly and then implement compiler macros to convert them into cheaper expansions.

What sort of function invocations are costly? As a first example, recall from section 4.6, Reader Security that functions can perform lambda destructuring and that this is a subset of the more general defmacro destructuring⁶. When functions accept keyword arguments we pass them as grouped pairs of keyword symbols and their corresponding values. Keyword arguments are extremely useful but functions that use them unfortunately suffer more invocation overhead than functions that don't. Destructuring isn't free. The compiler needs to compile code into the function that scans across a necessarily variable length argument list, gets the values into the right order (including inserting default values), and then actually executes the function. Generally, lisp compiles very fast code for destructuring these keyword arguments and so we almost never notice (or care about) this minor inefficiency. However, there are situations when we do care, particularly when we call such a function from within a performance-critical loop.

FAST-KEYWORDS-STRIP

(defun fast-keywords-strip (args)
  (if args
    (cond
      ((eq (car args) '&key)
        (fast-keywords-strip (cdr args)))
      ((consp (car args))
        (cons (caar args)
              #1=(fast-keywords-strip
                   (cdr args))))
      (t
        (cons (car args) #1#)))))

Fast-keys-strip is a utility that takes a lambda destructuring list consisting of nothing but regular and keyword arguments and returns a list of the symbols used to refer to these arguments. In other words, it returns (a b c) when passed (a b c) or (a &key b (c 0)), but passing it (a &optional b c) is forbidden.

DEFUN-WITH-FAST-KEYWORDS

(defmacro! defun-with-fast-keywords
           (name args &rest body)
  `(progn
     (defun ,name ,args ,@body)
     (defun ,g!fast-fun
            ,(fast-keywords-strip args)
            ,@body)
     (compile ',g!fast-fun)
     (define-compiler-macro ,name (&rest ,g!rest)
       (destructuring-bind ,args ,g!rest
         (list ',g!fast-fun
               ,@(fast-keywords-strip args))))))

Defun-with-fast-keywords is used in the same way as defun. Like defun, its first argument is a symbol naming the function, the second an argument list, and the rest forms for the function being defined to execute. Unlike defun, however, defun-with-fast-keywords forms can only be given regular and keyword arguments (no optionals, rests, etc). Exercise: Extend fast-keywords-strip to handle all lambda destructuring lists⁷.

The expansion of defun-with-fast-keywords is complicated. It expands into three forms⁸. The first defines the function just as if we had used regular defun instead. The second defines a function named with an automatic gensym, g!fast-fun. This function is similar to the first, except that it simply takes a non-keyword argument for every argument (keyword or not). Next a compiler macro is defined to transform calls to our first function into calls to the second function. So instead of having the first function perform the keyword destructuring, we take advantage of the compile-time knowledge of the form used to invoke the function and put the keywords together into the right order with destructuring bind.

KEYWORDS-TESTS

(defun
  slow-keywords-test (a b &key (c 0) (d 0))
  (+ a b c d))

(compile 'slow-keywords-test)

(defun-with-fast-keywords
  fast-keywords-test (a b &key (c 0) (d 0))
  (+ a b c d))

We now have a (nearly) dual syntax with defun. A regular definition of a function with keyword arguments looks like slow-keywords-test. It is compiled for benchmarking purposes below. Fast-keywords-test is written identically to slow-keywords-test except it uses defun-with-fast-keywords instead of defun. It turns out we don't need to compile this function because defun-with-fast-keywords expands into a call to compile on the only one of its definitions that needs it—the one named by the automatic gensym g!fast-fun.

KEYWORDS-BENCHMARK

(defun keywords-benchmark (n)
  (format t "Slow keys:~%")
  (time
    (loop for i from 1 to n do
      (slow-keywords-test 1 2 :d 3 :c n)))
  (format t "Fast keys:~%")
  (time
    (loop for i from 1 to n do
      (fast-keywords-test 1 2 :d 3 :c n))))

(compile 'keywords-benchmark)

Keywords-benchmark is a simple function that uses the time macro to tell us how long an equivalent series of calls takes against both the functions. Notice that we compile keywords-benchmark too. More about benchmarking will be said in section 7.7, Writing and Benchmarking Compilers.

* (keywords-benchmark 100000000)
Slow keys:
; Evaluation took:
;   17.68 seconds of real time

Fast keys:
; Evaluation took:
;   10.03 seconds of real time

Calling this function 100 million times is enough for us to see that even when both the functions are compiled, the function defined with defun-with-fast-keywords runs about 40% faster thanks to its compiler macro. Notice also that our compiler macro's performance doesn't rely on having the keyword arguments be constant values known at compile-time. See that we pass n, a varying lisp form, as the argument to the :c keyword. So the compiler macro expands the fast version into an identical version as the slow one except that it doesn't have the keyword destructuring overhead.

So why doesn't COMMON LISP do this for every function that accepts keywords and always avoid the overhead? Compiler macros are only applied at compile-time but we want to retain the ability to destructure arguments at run-time. Here is the important point about compiler macros: compiler macros are optimisations to function invocations, not to the functions themselves. In the case of keywords, compiler macros allow us to eliminate the overhead for compiled invocations of a function while still leaving the original function—and its keyword destructuring code—available for use at run-time. Compiler macros give us a dual syntax for two different operations which are distinguishable only by context. For another way to avoid keyword overhead see Norvig's PAIP^[PAIP-P323].

What other function invocations can benefit from compiler macros? Not only can we reduce destructuring overhead, but often we can reduce the overhead of the function itself by pre-processing constant arguments. A compiler macro can perform some preparation at compile-time so it doesn't have to be done at run-time. One of the most obvious examples of this is the format function. Consider how format (or, in C, printf) works. It is a function that you pass a control string to at run-time. Format then processes the control string and prints the formatted output to a stream (or returns it as a string). In essence, when you use format you make a function call to a format string interpreter with the control string being the program. With compiler macros we can eliminate the function call, pre-process the control string, and change the function invocation to specialised code spliced into the call-site where the compiler can make further optimisations. Sounds difficult, doesn't it? We have to know how to convert format control strings into their equivalent lisp code. Luckily, as with so many other things, COMMON LISP has already thought about this. COMMON LISP does formatting right. That is the domain specific language that it specifies for creating formatted output can macro-compile itself down to lisp code. This is part of lisp philosophy—everything should compile down to lisp. The macro that compiles control strings to lisp is formatter. When you give a control string to formatter it will expand into a lambda form that performs the desired formatting. For example, here is an expansion for a simple control string⁹:

* (macroexpand '(formatter "Hello ~a~%"))

#'(LAMBDA (STREAM &OPTIONAL
                  (#:FORMAT-ARG-1783
                    (ERROR "Missing arg"))
                  &REST FORMAT::ARGS)
    (BLOCK NIL
      (WRITE-STRING "Hello " STREAM)
      (PRINC #:FORMAT-ARG-1783 STREAM)
      (TERPRI STREAM))
    FORMAT::ARGS)
T

So formatter expands into a lambda form¹⁰. It has compiled the control string into a lisp form, suitable for evaluation or for macro embedding into other lisp code where it will become a compiled function or will be inlined into compiled code at the call-site. Notice however that the expansion of formatter must be passed a stream and cannot accept nil like format can. This is because the functions that formatter expands into (like write-string and terpri) require streams. Use the with-output-to-string macro to get around this.

FFORMAT

(defun fformat (&rest all)
  (apply #'format all))

(compile 'fformat)

(define-compiler-macro fformat
                       (&whole form
                        stream fmt &rest args)
  (if (constantp fmt)
    (if stream
      `(funcall (formatter ,fmt)
         ,stream ,@args)
      (let ((g!stream (gensym "stream")))
        `(with-output-to-string (,g!stream)
           (funcall (formatter ,fmt)
             ,g!stream ,@args))))
    form))

Fformat is a perfectly transparent wrapper around format. It exists so we can define a compiler macro for formatting. We need a new function name from format because defining compiler macros over top of functions specified by COMMON LISP is forbidden^[CLTL2-P260]. Our compiler macro takes advantage of a defmacro destructuring feature, &whole. We use this to bind form to the actual list structure of the macro invocation. This is done to take advantage of a feature of compiler macros: compiler macros can choose to not expand at all. If we return form, lisp will see that we are simply returning the form we were passed (by checking it with eq) and will ask the compiler macro for no further expansions of the form—even though we are expanding into a use of a function with a compiler macro. While compiling we have elected to use the other meaning of the form. This is the fundamental difference between a compiler macro and a regular macro. A compiler macro can share an exact dual syntax with a function but a regular macro cannot. In fformat a compiler macro will elect to not expand into the more efficient meaning when its control string argument is non-constant. In fformat, we still want invocations of fformat on non-string control strings (like function calls that return strings) to work. In other words, we still want to be able to generate control strings at run-time. Such invocations obviously cannot use compile-time optimisations on the control strings.

FFORMAT-BENCHMARK

(defun fformat-benchmark (n)
  (format t "Format:~%")
  (time
    (loop for i from 1 to n do
      ( format nil "Hello ~a ~a~%" 'world n)))
  (format t "Fformat:~%")
  (time
    (loop for i from 1 to n do
      (fformat nil "Hello ~a ~a~%" 'world n))))

(compile 'fformat-benchmark)

Fformat-benchmark is nearly identical to the keywords-benchmark function presented earlier. It uses time to compare the time required to perform a large number of format operations using both regular format and the new fformat. Here are the results for a million iterations:

* (fformat-benchmark 1000000)
Format:
; Evaluation took:
;   37.74 seconds of real time
;   [Run times include 4.08 seconds GC run time]
;   1,672,008,896 bytes consed.

Fformat:
; Evaluation took:
;   26.79 seconds of real time
;   [Run times include 3.47 seconds GC run time]
;   1,408,007,552 bytes consed.

About a 30% improvement. Not only does the compiler macro decrease the time required to perform the formatting, but it also conses less (which in turn decreases its garbage collection time). The compiler macro has avoided the interpretation of the format string at run-time, instead performing most of the calculation only once when the function was compiled—as it should be. Unfortunately, benchmarks often obscure or eliminate important details. While pre-compiling format strings with fformat will eliminate the interpretation overhead, it will do so at the cost of a larger compiled program. Even if main memory is plentiful, large code can run more slowly by reducing instruction cache performance.

In this section we have looked at ways to customise the performance of code using regular macros, read macros, and a special type of macro designed just for this task: compiler macros. Hopefully this section and the remainder of this chapter will convince you that if you want to write really efficient code, you want COMMON LISP. And you want COMMON LISP because of macros.

Exercise: Download Edi Weitz's CL-PPCRE (described in section 4.4, CL-PPCRE) and see how api.lisp uses compiler macros. Visit Edi's website and download a few of his lisp packages that seem interesting.

Exercise: When we wrote the compiler macro for fformat we were forced to use gensym explicitly because there is no define-compiler-macro! macro. Fix this. Difficult Exercise: Define define-compiler-macro! so it uses the functionality of defmacro! and doesn't call gensym itself. Hint: Think outside the box.

Getting To Know Your Disassembler

It can be difficult to really get a sense for what is expensive in lisp without inspecting the raw instructions executed by your processor for different lisp forms. Just like when writing macros it often helps to view their expansions, sometimes looking at the compiled expansions of lisp programs—usually assembly instructions—is similarly useful. Because lisp compilers can be and often are thought of as macro expanders, the machine code that they generate is, in a strange sort of sense, itself lisp code. Because lisp is not so much a language as it is a building material and fabric for creating languages, lisp is used to define and compile a language that just happens to be the same language of the instruction set of your processor.

COMMON LISP provides us a function called disassemble to look at compiled expansions. Disassemble is the analog of the CMUCL macroexpand-all extension described in ^{[USEFUL-LISP-ALGOS2]}. By passing disassemble a function, or a symbol for which a symbol-function binding exists, we can look at the raw machine code instructions that will be executed if the function is invoked.

The problem is that these raw machine code instructions don't look at all like lisp. Instead of lisp's soothing nested parenthesis, these instructions are usually strange, tiny steps for some very arbitrary machine. Looking at the compiled expansion of lisp code is similar to reading a poster with a magnifying glass. You can see any portion you like in great detail, but interpreting the big picture from this alone can be difficult or impossible. Worse, when looking at code in this level of detail, it can sometimes be impossible to look at any one piece of machine code and tell for certain why the compiler put it there.

Unfortunately, nobody really knows how to best implement lisp past the function of compile. There is a lot more macroexpansion to be done to the code, that is certain, some of which so certain that it could probably be standardised, but the best way to use hardware resources like CPU cycles and memory is still (and might always be) a very hot research topic. Even harder to track than improvements in compiler design are the constant improvements to hardware. Optimisations that made sense initially may become irrelevant or even plain incorrect. We don't need to look far for examples of how a changing world affects assumptions about efficiency.

Scientists¹¹ used to avoid using floating point computations in code that needed to perform well, instead opting for machine-word based, fixed point computations. This was because computers didn't have dedicated floating point hardware and were forced to use the processor's integer instructions to simulate it. Because the processors weren't really optimised for this, floating point operations were always much slower than fixed point operations. However, over time, hardware started to sprout dedicated floating point co-processors which were designed to perform these floating point operations at lightning speed. Almost over-night, scientists went from being able to assume that a fixed point operation would always be much faster than a floating point operation, to having to investigate and benchmark their hardware before deciding. Hardware developments changed the performance reality of floating point. Soon after, computers started to come with 2, 4, or more floating point co-processors and scientists discovered that if they were able to keep the pipeline of floating point instructions full, floating point operations could often perform even better than fixed point operations. Many programs that chose fixed point for performance reasons went—in the time-frame of a decade or so—from choosing the right implementation to choosing the wrong implementation.

DIS

(defmacro dis (args &rest body)
  `(disassemble
     (compile nil
       (lambda ,(mapcar (lambda (a)
                          (if (consp a)
                            (cadr a)
                            a))
                        args)
         (declare
           ,@(mapcar
               #`(type ,(car a1) ,(cadr a1))
               (remove-if-not #'consp args)))
         ,@body))))

Just as it is useful to look at the macroexpand and macroexpand-all output while developing macros, it is helpful to look at the disassemble output, not only to learn about how your implementation functions, but also to make sure you are giving lisp all the information it needs to generate efficient expansions. Dis is a macro that makes it easy to check at the disassemble output for a portion of lisp code. Its first argument is a list of symbols or lists of a type and a symbol. To see how dis words, expand it. Here is what dis expands to for a simple binary addition:

* (macroexpand
    '(dis (a b) (+ a b)))

(DISASSEMBLE
  (COMPILE NIL
    (LAMBDA (A B)
      (DECLARE)
      (+ A B))))
T

Why is the empty declare form there? It is a place-holder where dis can insert type declarations if you specify them in the parameters like so:

* (macroexpand
    '(dis ((fixnum a) (integer b))
       (+ a b)))

(DISASSEMBLE
  (COMPILE NIL
    (LAMBDA (A B)
      (DECLARE (TYPE FIXNUM A)
               (TYPE INTEGER B))
      (+ A B))))
T

Because dis expands into a (wrapped) lambda form, it works a lot like one. You can add extra declarations if you like, and the return value is important (because lambda forms provide an implicit progn). With this book's code loaded, try entering this form into your lisp environment:

(dis (a b)
  (+ a b))

The machine code should be fairly short, but this is because much of the complexity is hidden from view with the invocation of a pre-compiled function—one with enough smarts to provide all the fancy lisp number features like type contagion, rational simplification, etc. This is called indirection and is probably fairly obvious in your disassembler output:

CALL #x1000148 ; GENERIC-+

Try it with three arguments:

(dis (a b c)
  (+ a b c))

Exercise: How many indirections to the generic addition function do you see? How about with N where (<= 0 N) arguments?

Now try locking down a type for one of the variables. Compare this closely with the earlier example where no types were declared:

(dis ((fixnum a) b)
  (+ a b))

Some sort of OBJECT-NOT-FIXNUM-ERROR should now be apparent. Lisp has compiled in some extra code to do this type checking while still indirecting control to the generic addition function because what the type of b is unknown at compile-time and thus might require all of lisp's fancy numerical behaviour like contagion.

This is not how to get fast code. In fact this code might even be slightly less efficient than the previous. For fast code, we need to take advantage of a process called inlining. For some special operations, when enough type information is available, lisp compilers know how to avoid indirection and directly add machine code into the function being compiled to perform the desired operation. There should be no indirection to a generic addition function in the following:

(dis ((fixnum a) (fixnum b))
  (+ a b))

This inlining process probably resulted in more machine code than the one that employed indirection. This is because some (but not all) of the functionality that was implemented in the generic addition function was copied into the function that we compiled. Although it appears longer, in some circumstances this code will perform better because of less indirection.

But this mess of machine code is still a lot less efficient than might be possible in a C implementation. There are still all sorts of argument count, type, and overflow checks compiled in—so many that the actual cost of the addition is still low compared to its overhead. If we used this function in a loop, this overhead might be unacceptable.

With languages like C you specify types everywhere and enforce safety nowhere so code is always efficient, never safe, and always annoying to write. With most dynamic Blub languages you specify types nowhere and enforce safety everywhere so code is always safe, never annoying, but also never efficient. With most strong, static Blub languages you specify types everywhere and enforce safety everywhere so code is always efficient and always safe, but always annoying. Lisp gives you choice. Because lisp defaults to safe-and-not-annoying mode, lisp programs often seem slightly slower than their C equivalents, but are almost always safer. Because lisp allows programmers an excellent type declaration system and implementations have such excellent compilers, lisp programs are almost always just as safe as their dynamic Blub equivalents, and usually a whole lot faster. Most importantly, lisp has macros so if something is annoying, well, change it!

Let's go ahead and ask lisp to make our addition fast. Recall that sharp-f is a read macro abbreviation for a high-speed, low-safety declaration.

(dis ((fixnum a) (fixnum b))
  #f
  (+ a b))

This sequence of machine code instructions should be a bit shorter than before. The type checks and argument count checks should be removed. But this still isn't quite the single instruction, bang-off, dangerous fixnum addition we were looking for. To get some insight into what is happening, we should check for compiler notes. A note is an observation made by the compiler that essentially says: "You look like you're trying to do something efficient, and you're almost there, but I need a little clarification on your intent. Here's a tip for making it more clear..."

Compiler notes are invaluable sources of information. When trying to create efficient lisp code, you should read and consider them carefully. Lisp compilers use type inference systems to discover intricate properties of code that even you, the programmer, might not have considered. In the above example, our compiler should somewhere give us a note like:

; Note: Doing signed word to integer coercion
;       (cost 20) to "<return value>".

Lisp doesn't do anything silly like ignore a fixnum overflow unless we explicitly ask it to¹². So in order to get lisp to put caution to the wind and write us a function that really burns, but is possibly unsafe, we need to avoid the signed word (fixnum) to integer (bignum) check and coercion. We need to tell lisp that overflows are acceptable and, yes, we really want to silently return a fixnum:

(dis ((fixnum a) (fixnum b))
  #f
  (the fixnum (+ a b)))

Now we're burning. This is roughly equivalent to a C fixnum addition function: a few machine instructions that add together two registers and then return control to the caller. While your disassembler can offer many insights into all areas of lisp efficiency, there are main two skills that it will teach you. The first skill was mostly covered in this section: how to use declarations to get efficient numerical behaviour, especially inside of loops. The second is how to efficiently use array/vector data-structures. This will be discussed in section 7.4, Pointer Scope.

Just like technology advancements that changed floating point arithmetic's efficiency reality from something that should be avoided to something that should be exploited, advancements in lisp compiler technology—combined with COMMON LISP's right type and safety declaration systems—are changing how we think about efficiency. With these tools, and the growing complexity demands of software systems¹³, the question is changing from how to make lisp as efficient as low-level languages to how to make other languages as efficient as lisp. The answer, of course, is to use macros to implement them on lisp.

Pointer Scope

Does removing pointers from a language reduce the power of the language? In particular, does lisp's lack of explicit pointer scope prevent us from efficiently implementing algorithms specified in terms of pointer arithmetic? It turns out no, lack of direct support for pointers in lisp does not pose a theoretical or practical challenge. Any algorithm or data structure that can be implemented with pointers in a language like C can be implemented the same or better in lisp.

But, really, what is pointer scope and why might we want to use it? Pointer scope involves treating the computer's memory (or virtual memory) as a large, indexable array from which it can load and store fixnum values. Does this sound dangerous? It should, because it is the source of many complex bugs and the direct cause of several of the largest classes of software security problems today.

Pointer scoping is really a way of specifying indirections, that is accesses across environments, that just also happens to be tied to fixnum arithmetic. How do we normally program across environments? We use either the lexical or dynamic scoping provided by COMMON LISP, dual combinations of the two, or new types of scoping created by macros. The pointer-& macro and the pointer-* function are examples that sketch some of the illusion of pointer scope for us, showing that when you think you need a pointer, you probably really need a closure. The first and only mention of this analogy between pointers and closures I have heard described was in a post to the comp.lang.scheme newsgroup from Oleg Kiselyov^{[POINTERS-AS-CLOSURES]}. He suggested using closures to emulate pointers and provided an implementation for Scheme¹⁴.

POINTER-SCOPING-SKETCH

(defmacro! pointer-& (obj)
  `(lambda (&optional (,g!set ',g!temp))
     (if (eq ,g!set ',g!temp)
       ,obj
       (setf ,obj ,g!set))))

(defun pointer-* (addr)
  (funcall addr))

(defsetf pointer-* (addr) (val)
  `(funcall ,addr ,val))

(defsetf pointer-& (addr) (val)
  `(setf (pointer-* ,addr) ,val))

Pointer-& and pointer-* show a possible way to mimic pointer indirection through closures. When we use the pointer-& macro it expands into a lambda form that has some smarts in it to determine if you would like to get or set the value, and does accordingly. Pointer-& uses gensyms to do this. Instead of using them as a name for a binding so that unwanted variable capture is avoided at compile-time, pointer-& uses them to ensure that there is no possible run-time capture, where we are prevented from setting a closure's value a certain value because it collides with our implementation. For instance, we might have chosen the lisp default of nil for this which would usually work except if we tried to pass nil as an argument. Gensyms are convenient to use at run-time because we know there will never be another value eq to a gensym. That is their raison-d'etre.

Pointer-* and its defsetf are the framework for accessing these indirected values through generalised variables. The defsetf for pointer-& is there so that expansions of pointer-& will know how to set nested indirections. As a simple example, we can create a closure that mimics the pointer to a pointer pattern common in C by creating a reference to a binding in a let environment:

* (let ((x 0))
    (pointer-& (pointer-& x)))

#<Interpreted Function>

Let's store this closure for later use by transferring it from the * special variable (let's keep all these asterisks straight):

* (defvar temp-pointer *)

#<Interpreted Function>

Now we can dereference this closure:

* (pointer-* temp-pointer)

#<Interpreted Function>

It appears we have another closure. We have only dereferenced one step of the pointer chain. Using the * special variable to refer to the previous result, let's dereference further:

* (pointer-* *)

0

The 0 is the original object that we pointed to. We can also use this dereferencing syntax—which is of course an illusion over closures—to set the value of this object through the pointer chain:

* (setf (pointer-* (pointer-* temp-pointer)) 5)

5

Sure enough, this changes the original let environment being pointed to so that it has a new value, 5:

* (pointer-* (pointer-* temp-pointer))

5

If we like, we can add another layer of indirection:

* (pointer-& temp-pointer)

#<Interpreted Function>

Which now requires three dereferences:

* (pointer-* (pointer-* (pointer-* *)))

5

And itself can be accessed as a generalised variable:

* (setf (pointer-* (pointer-* (pointer-* **))) 9)

9

Even though they may be at different levels of indirection, all of the closures in this dereferencing chain still point back to the original let environment:

* (pointer-* (pointer-* temp-pointer))

9

But this probably isn't what you thought we meant by pointer scope. Because most computer processors consider memory a big array of fixnums, and because C was designed around the capabilities of existing processors, C's pointer scoping is permanently tied to fixnum arithmetic. In C, when you dereference a pointer you always know exactly what is happening: the compiler compiles in code to index into memory with a fixnum and retrieve or set a fixnum value. The largest difference between C's pointer scoping and our above closure dereferencing technique is that while C allows us to change where a pointer is pointing by adding or subtracting fixnums to it, the closures compiled by pointer-& and accessed with pointer-* are fixed. The code to access and set them—whatever that might be—is added to the indirection environment at compile time. Even in our simple example above, we used at least two different types of closures, both of which, thanks to generalised variables, were accessible through a uniform dereferencing syntax. The x we originally referred to was a lexical variable and the temp-pointer tunnel variable that we pointed to was a dynamic variable. As we saw in section 6.7, Pandoric Macros, we can customise closures, and thus indirection, in any way we want.

So closures are actually more flexible and less dangerous than C style pointers. When you think you need a pointer, you probably need a closure. Rather than just being a fixnum that can be used as an address, closures are code that is compiled to retrieve and set any sort of data in any sort of environment. Although for most tasks closures are the best construct to accomplish indirection, sometimes we would like to take advantage of our processor's fixnum addressing functionality to achieve extremely efficient code. C lets us do it; COMMON LISP lets us do it better.

Using C-style pointers in lisp is actually very straightforward and doesn't require a departure from our usual lisp technique. We simply provide a fixnum array and use numeric indices to index into that array—thinking about it exactly like C. We then use declarations to get lisp to drop type and safety checks so it is compiled exactly like C. Finally, we use macros to make the whole process convenient and safe.

In general, indexing into an array is a complex, slow procedure. The compiler needs to check that your index is numeric, you are trying to index an array, and the index is within the bounds of the array. Furthermore, arrays of different types can can have different code to access the elements. With this book's code loaded, try evaluating the following form (dis is described in detail in section 7.3, Getting To Know Your Disassembler):

(dis (arr ind)
  (aref arr ind))

Because aref can mean so many possible things when no types are known, your compiler will probably not inline the array access code. In the above disassembly output, you should see a function call to something like CMUCL's data-vector-ref. Exercise: Get the source code for your lisp environment and examine this function. In CMUCL it is in the file array.lisp. Also examine the other functions in that file, including data-vector-set. If your lisp environment doesn't come with complete source code, or you aren't able to do anything you want with the source code you do have, upgrade your COMMON LISP environment as soon as possible.

Just like COMMON LISP can inline the function + when it has enough type information, it can also inline aref. Try the following form:

(dis (((simple-array fixnum) arr)
       (fixnum ind))
  (aref arr ind))

The above should have removed the indirection to the general array reference function. Simple arrays are arrays of one dimension where the elements are adjacent in memory like C-style memory. In the above we specified fixnum as our array element, but your COMMON LISP environment probably also provides types for fixnums of different size, bytes, unsigned bytes, floats, double floats, and much more. Although the above doesn't contain the indirection, it still has lots of code that implements the type and safety checks we usually rely on when programming lisp. However, just as we can use the sharp-f read macro from section 7.2, Macros Make Lisp Fast to tell lisp to make arithmetic fast, the same can be done for array references:

(dis (((simple-array fixnum) arr)
      (fixnum ind))
  #f
  (aref arr ind))

Unlike our earlier arefs, the performance of this bit of code will not be dominated by type and safety checks. This is the code that should be used inside of performance critical loops. Notice that because we have removed almost all of the safety features from this code that it is just as dangerous as its C equivalent. In particular, it could suffer from buffer overflow problems¹⁵. With C you program this way everywhere. With lisp you program safely everywhere except where performance matters, tuning the hot-spots of the code to make your whole program run faster. Thanks to macros, these hot-spots can be arbitrarily small. There is no need to compile, for example, entire functions in fast/dangerous mode. Macros allow us to optimise narrow, specific parts of an expression. Fast code can transparently co-exist with safe code and macros let us trade away the least safety necessary in order to achieve the required performance.

Because if you have made it this far in the book you should already have a good grasp of macro authoring and declarations, there is not much more that needs to be said regarding pointer scoping. In short, C provides a very specific domain specific language for controlling the CPU based on fixnum arithmetic, but you can write much better languages using macros. Efficient pointer scoping (which we can now confess really means array access—closure examples notwithstanding) is mostly a matter of knowing how macros work, how declarations work, and how to read your disassembler.

WITH-FAST-STACK

(defmacro! with-fast-stack
           ((sym &key (type 'fixnum) (size 1000)
                      (safe-zone 100))
            &rest body)
  `(let ((,g!index ,safe-zone)
         (,g!mem (make-array ,(+ size (* 2 safe-zone))
                             :element-type ',type)))
     (declare (type (simple-array ,type) ,g!mem)
              (type fixnum ,g!index))
     (macrolet
       ((,(symb 'fast-push- sym) (val)
            `(locally #f
               (setf (aref ,',g!mem ,',g!index) ,val)
               (incf ,',g!index)))
         (,(symb 'fast-pop- sym) ()
            `(locally #f
               (decf ,',g!index)
               (aref ,',g!mem ,',g!index)))
         (,(symb 'check-stack- sym) ()
            `(progn
               (if (<= ,',g!index ,,safe-zone)
                 (error "Stack underflow: ~a"
                        ',',sym))
               (if (<= ,,(- size safe-zone)
                       ,',g!index)
                 (error "Stack overflow: ~a"
                        ',',sym)))))
         ,@body)))

An example macro that efficiently accesses arrays is with-fast-stack. This macro was chosen to give an opportunity to discuss something called amortisation. With-fast-stack implements a stack data-structure named by sym. Unlike COMMON LISP push and pop stacks that use cons cells to store stacks of elements of any type, these stacks use a simple array to store elements of a fixed type which can be specified by the :type keyword. The array is also of a fixed size but this size can be chosen through the :size keyword. The stack is accessed by using some macrolet defined local macros. If your stack's name is input, the macros bound will be fast-push-input, fast-pop-input, and check-stacks-input. Examine a compiled expansion with dis:

* (dis ((fixnum a))
    (with-fast-stack (input :size 2000)
      (loop for i from 1 to 1000000 do
        (fast-push-input a))))

The fast-push-input operation compiles into very tight (and very unsafe) machine code:

;;; [8] (FAST-PUSH-INPUT A)
MOV     ECX, [EBP-20]
MOV     EDX, [EBP-16]
MOV     EAX, [EBP-12]
MOV     [ECX+EDX+1], EAX
MOV     EAX, [EBP-16]
ADD     EAX, 4
MOV     [EBP-16], EAX

But the looping was compiled safely as usual, implementing error checking and indirection to arithmetic functions, even though it is inside a with-fast-stack macro.

;;; [7] (LOOP FOR I FROM 1...)
...
CALL    #x100001D0  ; #x100001D0: GENERIC-+
...
CALL    #x10000468  ; #x10000468: GENERIC->

Clearly this loop will not run as fast as it could. Its performance will be dominated by the looping overhead, not the stack operation. If we needed speed, we could declare i to be a fixnum and add speed declarations to the loop as we have seen before. Safe code can co-exist with fast code. The code we just disassembled is, of course, extremely dangerous. It doesn't ever check the height of the stack to see if we overflowed or underflowed past our boundaries. That was something we were trying to avoid for efficiency's sake. The solution offered by with-fast-stack is inspired by the word stacks in the forth programming language. By using the check-stacks-input local macro, our code can verify that the stack is within bounds and throw an error otherwise. Because forth is designed to perform well on the most limited of hardware platforms, forth amortises the cost of performing the bounds check. Instead of performing it after every operation like lisp will do by default, it only does it every N operations. In forth, this word is often invoked only after evaluating a form in the REPL (we will have more to say about forth in chapter 8, Lisp Moving Forth Moving Lisp). So instead of checking the bounds every operation, we could do it every 10 operations, perhaps reducing the bounds checking cost by 90%¹⁶. When we do check the stack we know that, at worst, it will be 10 elements out of bounds. Or maybe there is some convenient, non-performance-critical place in your code that the check macro can go.

Another feature of with-fast-stack is that the arrays it creates have safe zones. That is, it allocates extra memory on either side of the stack as a run-away lane in case you screw up. It doesn't mean that running into these safe zones is a good idea (especially the underflow one) but it is better than running into memory you didn't allocate.

As mentioned, the code we just assembled is very dangerous and can write fixnums into memory that wasn't allocated to it. Don't ever do this. Exercise: Do this. Here is what happened for me:

* (compile nil
    '(lambda (a)
       (declare (type fixnum a))
       (with-fast-stack (input :size 2000)
         (loop for i from 1 to 1000000 do
           (fast-push-input a)))))

#<Function>
NIL
NIL

The dangerous code compiled just fine. Let's try running it:

* (funcall * 31337)

NIL

Well, it wasn't the disaster we were afraid of. Did anything bad happen?

* (compile nil '(lambda () t))

; Compilation unit aborted.

Hm that doesn't sound good.

* (gc)
Help! 12 nested errors.
KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded.
** Closed the Terminal

NIL

That definitely doesn't sound good. Because lisp is a process running on unix, it is also possible to receive signals indicating you have written outside your allocated virtual memory (called a segfault). CMUCL handles these as recoverable conditions (though you should almost always reload your lisp image):

Error in function UNIX::SIGSEGV-HANDLER:
   Segmentation Violation at #x58AB5061.
   [Condition of type SIMPLE-ERROR]

In these states, the lisp image is said to be hosed. Programs that have the potential to become hosed like this are security disasters waiting to happen. The difference between C and lisp is that while C has this potential almost everywhere, lisp has it almost nowhere. If we ever need to take the risks of array based pointer scoping, lisp macros are the least obtrusive and safest means for doing so. Of course you almost never want to take these risks—stick to closures.

Tlists and Cons Pools

This section is about memory management, but it might not tell you what you want to hear. I was reluctant to even include it because I was afraid of perpetuating an incorrect myth about lisp, the mistaken notion that consing is slow. Sorry, but it just isn't true; consing is actually fast^[GC-IS-FAST]. Algorithms that minimise indefinite extent storage are usually ideal, of course, but most algorithms can be more easily and directly written by consing. Do not be afraid to cons when you need memory. In fact, sometimes an excellent optimisation that can be made in lisp is to adapt an algorithm into a form where it can be implemented with cons cells so as to benefit from a tuned lisp garbage collector. Just as writing your own hash-table implementation is probably a bad idea, hacking up your own memory allocation routines is probably just as dumb. That said, this section explains some ways to do it. Surprise, we do it with macros.

Before we come back to memory allocation, we take a quick related detour. Although COMMON LISP is the tool of choice for the professional lisp programmer, many of the best introduction-to-lisp textbooks have been written about Scheme. The generally most revered is Structure and Interpretation of Computer Programs^[SICP] by Hal Abelson, Jerry Sussman, and Julie Sussman. SICP¹⁷ ^{Pronounced sick-pea.} has been alternatively idolised or endured by freshmen students at MIT, where it was first introduced, for decades. Scheme's appeal to academia has been deep and pervasive. Most macro professionals start their lisp experience with Scheme—only when they are ready to start programming serious macros do they migrate to the macro hacker's language: COMMON LISP.

But when you migrate, you always take things with you. There is no escaping your past—your roots. If your roots lie with Scheme and you've read SICP, you probably remember queues (also see ^{[USEFUL-LISP-ALGOS1-CHAPTER3]}). An alternative description of them, the one we use here, is from from another good Scheme book, Schematics of Computation^[SCHEMATICS] and is called the tlist. A tlist is a data-structure named after its inventor, an Interlisp hacker named Warren Teitelman. Although tlists are provided as Scheme code in Schematics of Computation, we present them here as a port to COMMON LISP.

TLISTS

(declaim (inline make-tlist tlist-left
                 tlist-right tlist-empty-p))

(defun make-tlist () (cons nil nil))
(defun tlist-left (tl) (caar tl))
(defun tlist-right (tl) (cadr tl))
(defun tlist-empty-p (tl) (null (car tl)))

As we can see in the constructor, make-tlist, a tlist is just a cons cell. But instead of using the car as the element and cdr as the next cons like a regular list, a tlist uses the car to point to the first cons in the real list, and the cdr to point to the last. If the car of a tlist is nil, the tlist is considered empty. Unlike regular lists, empty tlists are distinct (not eq). For tlists, the car of the cons cell acting as a tlist points to a cons cell that holds the left element of the tlist. The cdr points to a cons that holds the right.

The functions tlist-left and tlist-right return the left and right elements of the tlist without modifying the tlist. If the tlist is empty, these functions return nil. If you only use these functions you will not be able to store nil in your tlist. Luckily you can check to see if a tlist is empty before you use it with the tlist-empty-p predicate, and can thus store nil.

Because doing it is so easy, we have decided to tell the compiler that all of these functions can be inlined¹⁸. This will allow the lisp compiler to generate more efficient expansions for the tlist functions. In some languages that don't provide much compiler control—like C—primitive macro systems are used to ensure that functions like our tlist utility are inlined. In lisp, where we control the compiler completely, there is no need to use macros for this purpose. The macros in this chapter are about much more than inlining.

TLIST-ADD

(declaim (inline tlist-add-left
                 tlist-add-right))

(defun tlist-add-left (tl it)
  (let ((x (cons it (car tl))))
    (if (tlist-empty-p tl)
      (setf (cdr tl) x))
    (setf (car tl) x)))

(defun tlist-add-right (tl it)
  (let ((x (cons it nil)))
    (if (tlist-empty-p tl)
      (setf (car tl) x)
      (setf (cddr tl) x))
    (setf (cdr tl) x)))

We can add elements to the left of a tlist with the tlist-add-left function, and to the right with tlist-add-right. Because a pointer to the end of the list is maintained, adding elements to either end of a tlist is a constant time operation with respect to the length of the tlist. However, in general, addition to a tlist is not quite a constant time operation because of the memory allocation overhead imposed by consing. Its use of cons means tlist addition usually incurs the aggregated overhead of garbage collection.

TLIST-REM-LEFT

(declaim (inline tlist-rem-left))

(defun tlist-rem-left (tl)
  (if (tlist-empty-p tl)
    (error "Remove from empty tlist")
    (let ((x (car tl)))
      (setf (car tl) (cdar tl))
      (if (tlist-empty-p tl)
        (setf (cdr tl) nil)) ;; For gc
      (car x))))

Only removing an item from the left of a tlist is supported with the given functions. Because we only keep a pointer to the first and last elements of the tlist, the only way to find the second-to-last element is to traverse the entire list, starting with the left of the tlist.

A tlist is a queue abstraction built on top of cons cells that is especially useful because it is a transparent data structure. While some data structures that implement tlist functionality—like queues—only provide you with a limited interface to the data structure, tlists are specified directly as cons cells. Instead of inventing some API to hopefully meet everyone's needs, Teitelman decided to tie the specification of the tlist directly to the lisp cons cell. This design decision is what separates the tlist from other queue implementations. When programming with transparent specifications, instead of making special API functions to do things, the code is the API.

TLIST-UPDATE

(declaim (inline tlist-update))

(defun tlist-update (tl)
  (setf (cdr tl) (last (car tl))))

If we decide to access the car of a tlist and modify its contents, we need to make sure that the tlist remains consistent. Assuming after our manipulation the desired list is stored in the tlist's car, we can use tlist-update to set the cdr appropriately¹⁹.

So the main benefit of a tlist is to emulate regular lisp lists as closely as possible, while enabling an operation that adds elements to the end in constant time. Because tlists use cons just like regular lists, they have exactly the same memory overheads.

COUNTING-CONS

(defvar number-of-conses 0)

(declaim (inline counting-cons))

(defun counting-cons (a b)
  (incf number-of-conses)
  (cons a b))

COMMON LISP doesn't specify much functionality for monitoring or controlling memory allocation. So let's write some. First off, recall from section 3.5, Unwanted Capture that we are not allowed to re-define or re-bind a function specified by COMMON LISP. We can't intercept calls to cons directly so we instead use a wrapper. Counting-cons is identical to cons except that every time it is called it increments the variable number-of-conses.

WITH-CONSES-COUNTED

(defmacro! with-conses-counted (&rest body)
  `(let ((,g!orig number-of-conses))
     ,@body
     (- number-of-conses ,g!orig)))

With-conses-counted is our primary interface for examining the value of number-of-conses. Its expansion will record its initial value, perform the operations provided in the macro body, then return the number of times that counting-cons was invoked.

COUNTING-PUSH

(defmacro counting-push (obj stack)
  `(setq ,stack (counting-cons ,obj ,stack)))

An unfortunate consequence of our strategy of renaming cons to counting-cons is that any routines we want to examine for memory performance need to be re-written to use counting-cons as in counting-push. Here we can see that each time it is invoked, counting-push calls counting-cons exactly once:

* (let (stack)
    (with-conses-counted
      (loop for i from 1 to 100 do
        (counting-push nil stack)
        (pop stack))))

100

The pop operator above remove the elements and the cons cells used to store them from the stack. What happens to these cons cells? They become garbage. Normally lisp spews out this garbage everywhere and nobody cares because COMMON LISP environments have excellent recycling programs called garbage collectors that reclaim this storage. However, collecting garbage isn't free—somebody has to pay for the garbage to be picked up, transported to wherever it goes, and processed into something again fit for use. What if we could create mini-recycling programs on-site? For example, the above loop invoked counting-cons 100 times, generating 100 pieces of garbage that need collecting. However, a quick look at the code reveals that stack never has more than one item on it at a time. If we recycled this one cons cell so it is available for counting-push again, we could avoid calling counting-cons to get another cons cell. This concept is known as a cons pool. In addition to reducing the pressure on the garbage collector, cons pools can also help improve the locality of data structures that allocated memory often.

WITH-CONS-POOL

(defmacro with-cons-pool (&rest body)
  `(let ((cons-pool)
         (cons-pool-count 0)
         (cons-pool-limit 100))
     (declare (ignorable cons-pool
                         cons-pool-count
                         cons-pool-limit))
     ,@body))

With-cons-pool is one way we can create cons pools. Notice that this macro expands into a let form, creating bindings for cons-pool, cons-pool-count, and cons-pool-limit. These variables will be used to store cons cells that can be recycled. Because it introduces variables invisibly, with-cons-pool is an anaphoric macro. Notice also that because COMMON LISP provides a dual syntax for lexical and dynamic variables that the anaphoric bindings this macro's expansion creates may be dynamic or lexical, depending on whether the anaphora are declared special at the site of the macro use or not.

CONS-POOL-CONS

(defmacro! cons-pool-cons (o!car o!cdr)
  `(if (= cons-pool-count 0)
     (counting-cons ,g!car ,g!cdr)
     (let ((,g!cell cons-pool))
       (decf cons-pool-count)
       (setf cons-pool (cdr cons-pool))
       (setf (car ,g!cell) ,g!car
             (cdr ,g!cell) ,g!cdr)
       ,g!cell)))

Cons-pool-cons expands into some code that allocates cons cells from a cons pool. It assumes that it is inside the lexical scope of a with-cons-pool, or, if the anaphora are declared special, then there currently exists dynamic bindings for them. Cons-pool-cons only invokes counting-cons when its pool is empty. It will never store more than cons-pool-limit in the pool.

CONS-POOL-FREE

(defmacro! cons-pool-free (o!cell)
  `(when (<= cons-pool-count
             (- cons-pool-limit 1))
     (incf cons-pool-count)
     (setf (car ,g!cell) nil)
     (push ,g!cell cons-pool)))

If we have determined that we no longer need a cons cell, we can move it into the cons pool by freeing it with cons-pool-free. When this is done, the code must promise to never again access the cons cell that it just freed. The code that cons-pool-free expands into will push the freed cons cell onto cons-pool and increment cons-pool-count unless cons-pool-count is greater than cons-pool-limit in which case the cell will be left for garbage collection to collect. Notice that you are not required to cons-pool-free your cons cells when you have determined you no longer need them because the garbage collector will still be able to determine when they are no longer needed. Freeing them is simply an efficiency optimisation we can make if we know a little bit of extra information that lisp doesn't.

MAKE-CONS-POOL-STACK

(defmacro make-cons-pool-stack ()
  `(let (stack)
     (dlambda
       (:push (elem)
         (setf stack
               (cons-pool-cons elem stack)))
       (:pop ()
         (if (null stack)
           (error "Tried to pop an empty stack"))
         (let ((cell stack)
               (elem (car stack)))
           (setf stack (cdr stack))
           (cons-pool-free cell)
           elem)))))

So the design of cons pools consists of two macros, one that creates anaphora, invisibly introducing either lexical or special bindings, and another that invisibly consumes these anaphora. Typically, another macro is used to combine these macros. Make-cons-pool-stack is one such example. It creates a data structure similar to the COMMON LISP stack, which, of course, is really just a list updated with the push and pop macros. However, our data structure is different from push and pop because it isn't transparently specified. The implementation details of these stacks are separate from how they are actually used. This is important because we do not want to require users of our stacks to use their own methods to push and pop data, instead we want them to use our memory-optimised versions. Make-cons-pool-stack use dlambda from section 5.7, Dlambda. Here is an example where we create a lexical cons pool enclosing a new stack data structure, and then push and pop an item 100 times:

* (with-cons-pool
    (let ((stack (make-cons-pool-stack)))
      (with-conses-counted
        (loop for i from 1 to 100 do
          (funcall stack :push nil)
          (funcall stack :pop)))))

1

Notice that counting-cons—which is the only memory allocation function used—is only invoked once. The one cons cell that is ever required is recycled instead of being collected. If this loop occurred in compiled code, and the loop iterated enough times, the cons pool version can be expected to execute faster, simply because the garbage collector will not be invoked. Often more importantly, our loop will not have unexpected pauses in execution when the garbage collector runs. Of course, we almost never notice these pauses because lisp is smart enough to not do full garbage collections at once, instead amortising the operation with a technique known as incremental collection. Garbage collectors also implement an optimisation called generational collection where recently allocated memory is collected more often than old memory. Surprisingly, this turns out to be a type of referencing counting^{[UNIFIED-THEORY-OF-GC]}.

But with cons pools we can cons less (or not at all) and thus reduce (or eliminate) garbage collection execution time indeterminism. Most lisp systems also have a way to temporarily disable the garbage collector so you can execute something without pausing and, instead, pause a bit longer at some point in time where you don't care about such pauses. In CMUCL you can use the gc-on and gc-off functions. Also see the code in signal.lisp. Exercise: Disable the garbage collector then cons up a bunch of garbage in a loop. Use the unix top program to monitor your memory usage.

MAKE-SHARED-CONS-POOL-STACK

(with-cons-pool
  (defun make-shared-cons-pool-stack ()
    (make-cons-pool-stack)))

Although the above stack implementation requires you to preserve locality in the same lexical context with a with-cons-pool to indicate the stacks you want to share a cons pool, thanks to the transparent design of these macros, we can combine them with closures to indicate this locality however we like. Make-shared-cons-pool-stack works the same as make-cons-pool-stack except that it doesn't require you wrap with-cons-pool around them. These variables have already been captured. So all stacks created with make-shared-cons-pool-stack will share the same cons pool.

WITH-DYNAMIC-CONS-POOLS

(defmacro with-dynamic-cons-pools (&rest body)
  `(locally (declare (special cons-pool
                              cons-pool-count
                              cons-pool-limit))
     ,@body))

Thanks to the duality of syntax between lexical and special variables, we can choose to use the dynamic environment to store our cons pools. The with-dynamic-cons-pools macro makes any cons pool references in its lexical scope refer to the dynamic bindings of the anaphora. One strategy is to wrap all code that uses cons pools with with-dynamic-cons-pools then, when you actually execute your program, have dynamic bindings created for the cons pool. Because you can shadow dynamic bindings with new dynamic bindings, you can preserve locality to any dynamic granularity. To create dynamic bindings, simply wrap with-dynamic-cons-pools around a with-cons-pool.

FILL-CONS-POOL

(defmacro fill-cons-pool ()
  `(let (tp)
     (loop for i from cons-pool-count 
                 to cons-pool-limit
           do (push
                (cons-pool-cons nil nil)
                tp))
     (loop while tp
           do (cons-pool-free (pop tp)))))

Especially when trying to reduce garbage collection execution time indeterminism, it can be necessary to ensure a cons pool has available cells in its pool so that the program will not cons at all (assuming we don't exhaust the pool). To do this, simply cons the required cells initially—when it is acceptable to cons-then add them to the pool with fill-cons-pool, filling the cons pool up to its cons-pool-limit.

Memory is a very complicated topic and its efficiency implications depend on your hardware, your lisp implementation, and inevitable advances in technology. Unless you really know what you're doing, trying to improve on your system's memory routines will probably be more trouble than it is worth. Systems programmers have been tweaking memory for as long as there has been systems. They will certainly be doing it for a while yet. Memory management is hard—the only thing that is certain is that macros are the best tool for doing it.

Sorting Networks

There is no better tool for experimenting with efficiency or actually implementing efficient programs than lisp. Lisp is unique because not only does it allow us to concentrate on smart algorithms and designs, but also lets us exploit these algorithms and designs to their maximal efficiency potential using state of the art machine-code compilers. This section describes, from a lisp perspective, a corner of computer science that has been extensively studied but still nowhere near exhausted: sorting. Most people consider sorting a solved problem so you may surprised to learn that there are still many important open questions.

We know of many excellent general-purpose sorting algorithms. Algorithms like quick sort are the most common because they efficiently sort large quantities of data. But if, instead, we wish to sort many small batches of data, general-purpose sorting algorithms like quick sort can be overkill. This section is about a solution to this problem that many people have been obsessed with for decades but which is still very fertile ground for research. Most importantly to us, this solution gives an opportunity to show advanced optimisation techniques that are straightforward in lisp but are such major undertakings in most other languages that they are hardly worth it. In this section and the next, we will re-implement a macro described by Graham in On Lisp called sortf^{[ON-LISP-P176]}. While Graham's sortf was designed to illustrate how to write macros using generalised variables, ours is designed for speed. For certain circumstances, our sortf will achieve order-of-magnitude improvements over our system's well-tuned sort function.

This section is dedicated to my teacher and friend Alan Paeth who taught me, among many other things, that even sorting can be interesting. I also gratefully acknowledge John Gamble and his excellent Perl program, Algorithm-Networksort^{[ALGORITHM-NETWORKSORT]}. This program was used to experiment with the different algorithms and also to generate the ASCII art networks that appear in this section.

A sorting network is an algorithm for obliviously sorting data-sets of a particular fixed size. That is, unlike most algorithms like quick sort, the operation of a sorting network does not depend on the particular data-set it is used to sort. Every step of the sort was decided when the network was designed. A sorting network is a simple list of pairs of indices into a data-set. Each of these pairs corresponds to the indices that should be used in a compare-swap operation. After performing all of these compare-swap operations in sequence, the elements will be in sorted order.

Algorithms like quick sort that are excellent for large data-sets can have unacceptable overheads for certain classes of sorting problems. First, quick sort implementations usually allow you to choose a custom comparison operator so as to make the sorting code more general. This means that every comparison will entail a function call to the comparison function instead of, say, being implemented as inline machine code. Second, because quick sort implementations are so general, they often can't take advantage of optimisations that can be made when we know our data-sets are of particular small, fixed sizes. Third, often we don't want to completely sort a data-set, but instead sort only enough of it to determine a certain element—perhaps the median element. Sorting networks that don't bother to find the complete ordering are sometimes called selection networks.

To clarify the notion of a sorting network, as well as to give an idea of how subtle and counterintuitive the topic can be, we consider some of the simplest possible networks: ones that sort three elements. Most programmers know that sorting three elements can be done easily with three comparisons and often don't bother using quick sort when there are exactly three elements. It is easy to convince yourself that these compare-swap operations can be executed in any order and the result will be the same. But it is not nearly so obvious that some of the orderings are inherently less efficient than others.

BAD-3-SN

(defvar bad-3-sn
  '((0 1) (0 2) (1 2)))

BAD-3-SN-PIC

o—^—^——-o
   |  |      
o—v—|—^—o
      |  |   
o——-v—v—o

The network bad-3-sn might be the most obvious implementation of a three element network but is—as the name suggests—not the optimal one. The ASCII art picture helps to visualise the algorithm described by the list-based network description in bad-3-sn. The algorithm says to compare the elements at indices 0 and 1 of a data-set and, if they are out of order, swap them into correct order. Perform the same operation for the pair of indices (0 2) and then finally for (1 2). After this process, the elements will be sorted. If we implemented this sorting network as code to sort an array of length three, call it a, then it might look like this²⁰:

(progn
  (if (> (aref a 0) (aref a 1))
    (rotatef (aref a 0) (aref a 1)))
  (if (> (aref a 0) (aref a 2))
    (rotatef (aref a 0) (aref a 2)))
  (if (> (aref a 1) (aref a 2))
    (rotatef (aref a 1) (aref a 2))))

GOOD-3-SN

(defvar good-3-sn
  '((0 2) (0 1) (1 2)))

GOOD-3-SN-PIC

o—^—^——-o
   |  |      
o—|—v—^—o
   |     |   
o—v——-v—o

Bad-3-sn is correct but inefficient compared to good-3-sn. By exchanging the order of the first two comparison-swap operations, we achieve a more efficient network. On average, this network will perform fewer swap operations than will bad-3-sn. The best way to describe this is with conditional probability but because this is a book about lisp, and not sorting networks, we will shy away from this. Instead, we show that good-3-sn is better than bad-3-sn by enumerating all the permutations and then measuring the number of swaps that take place when we interpret them with the two networks. For now here is an intuitive explanation: if the long link in the network is performed first, then after this first operation at least one of the minimum or maximum elements will be in its correct, final position. So at least one of the subsequent comparison-swap operations will not perform a swap. If, however, a short link is performed first then it is possible that neither of these elements will be in their final positions and will both require future swapping.

INTERPRET-SN

(defvar tracing-interpret-sn nil)

(defun interpret-sn (data sn)
  (let ((step 0) (swaps 0))
    (dolist (i sn)
      (if tracing-interpret-sn
        (format t "Step ~a: ~a~%" step data))
      (if (> #1=(nth (car i) data)
             #2=(nth (cadr i) data))
        (progn
          (rotatef #1# #2#)
          (incf swaps)))
      (incf step))
    (values swaps data)))

To explore this phenomenon, we implement an interpreter for sorting networks, interpret-sn. This interpreter will apply a sorting network sn to a data-set represented by a list. It will return the number of swaps that were performed as the first value and the resulting sorted data-set as the second. Notice the use of the #= and ## self-referential read macros used to avoid re-typing the accessor forms. Notice also the use of a tracing variable that we can bind to a non-null value if we want to see the step-by-step sorting process. To start, consider a data-set that is already sorted. Obviously both bad-3-sn and good-3-sn perform no swapping:

* (let ((tracing-interpret-sn t))
    (interpret-sn '(1 2 3) bad-3-sn))
Step 0: (1 2 3)
Step 1: (1 2 3)
Step 2: (1 2 3)
0
(1 2 3)
* (let ((tracing-interpret-sn t))
    (interpret-sn '(1 2 3) good-3-sn))
Step 0: (1 2 3)
Step 1: (1 2 3)
Step 2: (1 2 3)
0
(1 2 3)

Next, consider a case where every element is out of sequence. Again, both sorting networks perform identically, performing the necessary two swaps:

* (let ((tracing-interpret-sn t))
    (interpret-sn '(3 1 2) bad-3-sn))
Step 0: (3 1 2)
Step 1: (1 3 2)
Step 2: (1 3 2)
2
(1 2 3)
* (let ((tracing-interpret-sn t))
    (interpret-sn '(3 1 2) good-3-sn))
Step 0: (3 1 2)
Step 1: (2 1 3)
Step 2: (1 2 3)
2
(1 2 3)

However, here is a case where bad-3-sn results in a worst-case behaviour of three swaps:

* (let ((tracing-interpret-sn t))
    (interpret-sn '(3 2 1) bad-3-sn))
Step 0: (3 2 1)
Step 1: (2 3 1)
Step 2: (1 3 2)
3
(1 2 3)
* (let ((tracing-interpret-sn t))
    (interpret-sn '(3 2 1) good-3-sn))
Step 0: (3 2 1)
Step 1: (1 2 3)
Step 2: (1 2 3)
1
(1 2 3)

In the above, bad-3-sn performed three swaps where the optimal good-3-sn performed only one. Shouldn't there be a symmetric case where good-3-sn performs poorly? It turns out no, good-3-sn is really just better. If you still don't believe this, investigate the Monty Hall problem to get a feel for just how counterintuitive these sorts of problems can be. So it seems the moral is to always swap elements into their correct positions as soon as possible so that the least amount of swaps will take place.

ALL-SN-PERMS

(defun all-sn-perms (n)
  (let (perms curr)
    (funcall
      (alambda (left)
        (if left
          (loop for i from 0 to (1- (length left)) do
            (push (nth i left) curr)
            (self (append (subseq left 0 i)
                          (subseq left (1+ i))))
            (pop curr)) 
          (push curr perms)))
      (loop for i from 1 to n collect i))
    perms))

To quantify how much better good-3-sn is than bad-3-sn we present a utility all-sn-perms that generates all permutations of the numbers from 1 to n. All-sn-perms embodies many lisp patterns, including recursively consing up a network of connected, temporary lists, and the use of Graham's anaphoric alambda macro. Here, we generate all 6 (factorial of 3) permutations of the numbers 1 to 3:

* (all-sn-perms 3)

((1 2 3) (2 1 3) (1 3 2)
 (3 1 2) (2 3 1) (3 2 1))

Notice that because of how all-sn-perms is written, the above lists share structure with one another so when using them for interpreting sorting networks (a destructive operation) we should always make sure to sort copies of them, as in average-swaps-calc. For problems with results that can be structured this way, sharing structure like this is generally a good programming technique because it can reduce the total memory required for your data-structures²¹.

AVERAGE-SWAPS-CALC

(defun average-swaps-calc (n sn)
  (/ (loop for i in (all-sn-perms n) sum
       (interpret-sn (copy-list i) sn))
     (fact n)))

Using our interpret-sn sorting network interpreter, we can use the actual numbers of swaps that it records for each possible permutation with average-swaps-calc. This function simply loops across each permutation, applying the interpreter against the given sorting network, summing the swaps that occurred, and then returning this sum divided by the number of possible permutations. If we make the assumption that every possible permutation is equally likely, this calculation represents the average number of swaps that will take place for every sort. Here we see that bad-3-sn results in, on average, 1.5 swaps per sort:

* (average-swaps-calc 3 bad-3-sn)

3/2

Where good-3-sn results in, on average, only 1.166... swaps:

* (average-swaps-calc 3 good-3-sn)

7/6

Up until now, our sorting networks have been only able to sort data-sets of size three. Are there algorithms for generating sorting networks of arbitrary size? Yes, and these algorithms have been known for some time. In 1968, Ken Batcher described his ingenious algorithm^{[SN-APPLICATIONS]} named by Donald Knuth as merge exchange sort or algorithm 5.2.2M from ^{[TAOCP-VOL3-P111]}. Batcher's algorithm is kind of a combination of shell sort and merge sort except that given a known input size, the comparison-swap operations it will make are completely determined independent of the data itself—exactly what we need for sorting networks. So to create a sorting network we run Batcher's algorithm and record what comparison-swap operations were made. Later we can inline these operations into a function for this particular input size. This process is not completely unlike loop unrolling, except that lisp allows us to take it much further.

BUILD-BATCHER-SN

(defun build-batcher-sn (n)
  (let* (network
         (tee (ceiling (log n 2)))
         (p (ash 1 (- tee 1))))
    (loop while (> p 0) do
      (let ((q (ash 1 (- tee 1)))
            (r 0)
            (d p))
        (loop while (> d 0) do
          (loop for i from 0 to (- n d 1) do
            (if (= (logand i p) r)
              (push (list i (+ i d))
                    network)))
          (setf d (- q p)
                q (ash q -1)
                r p)))
      (setf p (ash p -1)))
    (nreverse network)))

Build-batcher-sn is a lisp implementation of Batcher's algorithm directly transcribed from Knuth's description. Thanks to lisp's arbitrary precision support for bit-wise integer operations, this implementation doesn't suffer any artificial size limitations at n of, say, 32 or 64. We can use build-batcher-sn to easily construct efficient sorting networks of any size. Here is its construction for a network of size three—the same as good-3-sn above:

* (build-batcher-sn 3)

((0 2) (0 1) (1 2))

And here is its construction for a network of size seven:

* (build-batcher-sn 7)

((0 4) (1 5) (2 6) (0 2) (1 3) (4 6) (2 4)
 (3 5) (0 1) (2 3) (4 5) (1 4) (3 6) (1 2)
 (3 4) (5 6))

BATCHER-7-SN-PIC

o—^————^——-^————————-o
   |        |     |                  
o—|—^——-|—^—v————^—^——-o
   |  |     |  |           |  |      
o—|—|—^—v—|—^——-^—|—v——-o
   |  |  |     |  |     |  |         
o—|—|—|——-v—|—^—v—|—^—^—o
   |  |  |        |  |     |  |  |   
o—v—|—|—^——-v—|—^—v—|—v—o
      |  |  |        |  |     |      
o——-v—|—|————v—v——-|—^—o
         |  |                 |  |   
o————v—v————————-v—v—o

Batcher's networks are good but are known to be slightly sub-optimal for most network sizes. While better networks for many particular sizes have been discovered, how to find these better networks, as well as whether they are optimal or not, is an important unsolved problem. This area of research has had important advances made by evolutionary algorithms that use novel artificial intelligence techniques to effectively search the super-exponential spaces of sorting network problems. For instance, the best network of size thirteen currently known was discovered by the Evolving Non-Determinism algorithm^[END].

The ASCII art representations of sorting networks shown here were created by John Gamble's excellent Algorithm-Networksort Perl program. Notice that the picture puts some links that can be performed in parallel together into the same vertical column. This shows that sorting networks are algorithms that can, at least in specialised hardware, benefit from parallelism in the comparison-swap operations. Discovering how to create good parallel sorting networks, along with just how parallel we can make them, also remain important, mostly open problems.

PRUNE-SN-FOR-MEDIAN

(defun prune-sn-for-median (elems network)
  (let ((mid (floor elems 2)))
    (nreverse
      (if (evenp elems)
        (prune-sn-for-median-aux
          (reverse network)
          (list (1- mid) mid))
        (prune-sn-for-median-aux
          (reverse network)
          (list mid))))))

(defun prune-sn-for-median-aux (network contam)
  (if network
    (if (intersection (car network) contam)
      (cons (car network)
            (prune-sn-for-median-aux
              (cdr network)
              (remove-duplicates
                (append (car network) contam))))
      (prune-sn-for-median-aux
        (cdr network) contam))))

Above we mentioned one disadvantage of general-purpose sorting functions is that they are hard-coded into performing the entire sort operation. If we like, we can sort the data-set just enough to know for certain that one element is in its final position. Typically, the element we are interested in is the middle, or median element. The functions prune-sn-for-median and prune-sn-for-median-aux employ a modest, mostly obvious algorithm I have discovered that can eliminate many of the unnecessary comparison-swap operations so as to construct arbitrary selection networks.

The algorithm starts with a Batcher network then works backwards, keeping track of contaminated elements—elements that can't have any of their existing links removed because doing so will change the outcome of the network for that element. Any links that connect uncontaminated elements can be removed without changing the outcome for the contaminated elements. Every link that connects a contaminated element to an uncontaminated link contaminates the uncontaminated element. When we contaminate just the middle element (or the two middle elements in case of even input sizes) we create a median selection network.

HOYTE-7-MEDIAN-SN-PIC

o—^————^——-^————————-o
   |        |     |                  
o—|—^——-|—^—v————^————o
   |  |     |  |           |         
o—|—|—^—v—|—^——-^—|————o
   |  |  |     |  |     |  |         
o—|—|—|——-v—|—^—v—|—^—^—o
   |  |  |        |  |     |  |  |   
o—v—|—|—^——-v—|—^—v—|—v—o
      |  |  |        |  |     |      
o——-v—|—|————v—v——-|——-o
         |  |                 |      
o————v—v————————-v——-o

The algorithm's output for size 7, a modified Batcher network with two of its links removed, is shown. After running this network, the median element will be in the correct position, but other elements are not guaranteed to be sorted. As an example, here we only sort the list enough to discover that 4 is the median element:

* (interpret-sn
    '(4 2 3 7 6 1 5)
    (prune-sn-for-median
      7 (build-batcher-sn 7)))

6
(1 3 2 4 5 7 6)

PRUNE-SN-FOR-MEDIAN-CALC

(defun prune-sn-for-median-calc (n)
  (loop for i from 2 to n collect
    (let* ((sn (build-batcher-sn i))
           (snp (prune-sn-for-median i sn)))
      (list i
           (length sn)
           (length snp)))))

For networks of size seven, our modified median Batcher networks perform 12 comparison-swap operations versus the 14 operations that are performed by the regular Batcher network. Prune-sn-for-median-calc gives us the data on this class of networks for different size sorting networks. It computes the Batcher network for sizes up to n and groups their size with the size of the associated median network created by our algorithm.

PRUNED-MEDIAN-DATA

* (prune-sn-for-median-calc 49)

((2 1 1) (3 3 3) (4 5 5) (5 9 8) (6 12 12)
 (7 16 14) (8 19 17) (9 26 22) (10 31 29)
 (11 37 31) (12 41 35) (13 48 40) (14 53 47)
 (15 59 49) (16 63 53) (17 74 61) (18 82 72)
 (19 91 75) (20 97 81) (21 107 88) (22 114 98)
 (23 122 100) (24 127 105) (25 138 113) (26 146 124)
 (27 155 127) (28 161 133) (29 171 140) (30 178 150)
 (31 186 152) (32 191 157) (33 207 169) (34 219 185)
 (35 232 190) (36 241 199) (37 255 209) (38 265 223)
 (39 276 226) (40 283 233) (41 298 244) (42 309 259)
 (43 321 263) (44 329 271) (45 342 280) (46 351 293)
 (47 361 295) (48 367 301) (49 383 313))

PAETH-9-MEDIAN-SN

(defvar paeth-9-median-sn
  '((0 3) (1 4) (2 5) (0 1) (0 2) (4 5) (3 5) (1 2)
    (3 4) (1 3) (1 6) (4 6) (2 6) (2 3) (4 7) (2 4)
    (3 7) (4 8) (3 8) (3 4)))

PAETH-25-MEDIAN-SN

(defvar paeth-25-median-sn
  '((0 1) (3 4) (2 4) (2 3) (6 7) (5 7) (5 6) (9 10)
    (8 10) (8 9) (12 13) (11 13) (11 12) (15 16)
    (14 16) (14 15) (18 19) (17 19) (17 18) (21 22)
    (20 22) (20 21) (23 24) (2 5) (3 6) (0 6) (0 3)
    (4 7) (1 7) (1 4) (11 14) (8 14) (8 11) (12 15)
    (9 15) (9 12) (13 16) (10 16) (10 13) (20 23)
    (17 23) (17 20) (21 24) (18 24) (18 21) (19 22)
    (8 17) (9 18) (0 18) (0 9) (10 19) (1 19) (1 10)
    (11 20) (2 20) (2 11) (12 21) (3 21) (3 12)
    (13 22) (4 22) (4 13) (14 23) (5 23) (5 14)
    (15 24) (6 24) (6 15) (7 16) (7 19) (13 21)
    (15 23) (7 13) (7 15) (1 9) (3 11) (5 17) (11 17)
    (9 17) (4 10) (6 12) (7 14) (4 6) (4 7) (12 14)
    (10 14) (6 7) (10 12) (6 10) (6 17) (12 17)
    (7 17) (7 10) (12 18) (7 12) (10 18) (12 20)
    (10 20) (10 12)))

Network sizes up to 49 are calculated. Notice that at the smallest sizes, very few, if any, operations are saved. But for slightly larger numbers, we begin to save roughly 20% of the comparison-swap operations from being performed. When we are only concerned with medians, these networks are good choices. However, the construction of optimal median sorting networks is also an open area of research. The modified Batcher networks developed in this chapter are decent but still far from optimal. The best median selection networks for sizes 9 and 25 (3x3 and 5x5 image kernel sizes) currently known were discovered by Paeth^{[GRAPHICS-GEMS-P171-175]} and are presented here and included with this book's code. Here are the lengths for Paeth's median networks:

* (length paeth-9-median-sn)

20
* (length paeth-25-median-sn)

99

For networks of size 9, Batcher's full sorting network performs 26 operations. The best currently known was discovered by Floyd and is 25 operations. Our pruned median version of Batcher's network performs 22, and Paeth's median network 20. For networks of size 25, Batcher: 138, pruned: 113, Paeth: 99. So our median networks seem about 10% away from Paeth's, the currently best known median networks for these sizes. As expected, we can't prune any extra operations off Paeth's networks:

* (length (prune-sn-for-median
            9 paeth-9-median-sn))

20
* (length (prune-sn-for-median
            25 paeth-25-median-sn))

99

In theory, this is all very interesting. But in practice, theory is pretty boring. We developed all these list-based sorting networks, some that perform full sorts using Batcher's algorithm, and some that employ a contamination optimisation to Batcher's algorithm to find medians. Then we developed a toy interpreter for these networks that will no doubt perform horribly when compared to real sorting routines. What does any of this have to do with efficiency? Were our experiments just generating theoretical results instead of useful code²²? In most languages the results of these experiments—our sorting networks—would be represented in some high-level data structure and not really be good for much. But in lisp these networks are already extremely efficient sorting programs; we just haven't written a compiler for them yet.

Exercise: Adapt the pruning algorithm (and its contamination approach) so it produces quartile selection networks. These are networks where not only the median is determined but also the median elements of the high and low halves of ordered elements.

Writing and Benchmarking Compilers

A compiler is a scary concept to most programmers because most languages are bad for writing them. Here is an analogy: parsing a complicated log file might be an intimidating, error-prone prospect to a programmer who only knows C or assembly, but thanks to Perl and regular expressions it is a non-issue to us multi-lingual programmers. Similarly, designing a powerful, expressive programming language and then creating a compiler to convert programs in this language into efficient machine code would be an intimidating task if we didn't know lisp. Lisp's advantage when it comes to writing compilers doesn't just make it a little bit better than other languages—it actually makes a new level of expression. In general this advantage is the difference between being able to do something and not being able to do something. Lisp programmers use compilers everywhere, and in ways and for tasks that non-lisp programmers sometimes don't even believe. How many C programmers have considered the interpretation overhead of the printf function described (and overcome) in section 7.2, Macros Make Lisp Fast? How many would try writing a compiler for printf? In lisp this is par for the course. Everything should compile down to lisp.

What is a compiler? If you're coming from Blub, the answer is probably buried in a large stack of books explaining parsing, syntax directed translation, context free grammars, etc. But don't worry, this is lisp and compilers are easy. So easy that if you have ever done any amount of serious lisp programming you have already written them, perhaps without even realising it. Another name for a compiler is a "macro". A macro compiles programs from one language to another. Lisp is actually all about writing these compilers—everything else is secondary. In lisp the only non-trivial aspect of compiler design is how to preserve the correctness of the target program while at the same time discovering efficient expansions for it. In other words, the essence of your compilation problem. So far, we've seen how to use macros to create custom languages fitted exactly to the task at hand, and also how to make lisp code efficient by using declarations to remove dualities and safety checks. Effective compiler writing is merely combining these two skills.

SN-TO-LAMBDA-FORM-1

(defun sn-to-lambda-form% (sn)
  `(lambda (arr)
     #f
     (declare (type (simple-array fixnum) arr))
     ,@(mapcar
         #`(if (> #1=(aref arr ,(car a1))
                  #2=(aref arr ,(cadr a1)))
             (rotatef #1# #2#))
         sn)
     arr))

When we were creating a compiler macro to handle format in section 7.2, Macros Make Lisp Fast, what did the formatter compiler expand into? It was a lambda form²³ ^{Sharp-quoted lambda form actually.} . Compiling to lambda forms sometimes makes sense because we can then use the compile function to convert them to machine code directly. Returning to our sorting networks from the previous chapter, sn-to-lambda-form% is a function that returns a lambda form. This lambda form will have an instruction for every comparison-swap operation in a list-based sorting network. Each instruction will (unsafely) index into a fixnum array, comparing and possibly using rotatef to swap elements. The fixnum array will be passed as an argument (arr) to the function created by this lambda form. That's about all there is to a decent machine code compiler. As with all lambda forms, thanks to the lambda macro we are able to evaluate them to get functions:

* (eval
    (sn-to-lambda-form%
      (build-batcher-sn 3)))

#<Interpreted Function>

Which can become compiled functions simply by calling compile on them:

* (compile nil *)

#<Function>

Let's look at the disassemble output (the compiled expansion):

* (disassemble *)
...
;;; (> (AREF ARR 0) (AREF ARR 2))
      9E:       MOV     EAX, [EDX+1]
      A1:       MOV     ECX, [EDX+9]
      A4:       CMP     EAX, ECX
      A6:       JLE     L0
;;; (ROTATEF (AREF ARR 0) (AREF ARR 2))
      A8:       MOV     EAX, [EDX+9]
      AB:       MOV     ECX, [EDX+1]
      AE:       MOV     [EDX+1], EAX
      B1:       MOV     [EDX+9], ECX
;;; (> (AREF ARR 0) (AREF ARR 1))
      B4: L0:   MOV     EAX, [EDX+1]
      B7:       MOV     ECX, [EDX+5]
      BA:       CMP     EAX, ECX
      BC:       JLE     L1
;;; (ROTATEF (AREF ARR 0) (AREF ARR 1))
      BE:       MOV     EAX, [EDX+5]
      C1:       MOV     ECX, [EDX+1]
      C4:       MOV     [EDX+1], EAX
      C7:       MOV     [EDX+5], ECX
;;; (> (AREF ARR 1) (AREF ARR 2))
      CA: L1:   MOV     EAX, [EDX+5]
      CD:       MOV     ECX, [EDX+9]
      D0:       CMP     EAX, ECX
      D2:       JLE     L2
;;; (ROTATEF (AREF ARR 1) (AREF ARR 2))
      D4:       MOV     EAX, [EDX+9]
      D7:       MOV     ECX, [EDX+5]
      DA:       MOV     [EDX+5], EAX
      DD:       MOV     [EDX+9], ECX
      E0: L2:   ...

The above machine code is fast, but it could be faster. Lisp compilers are smart—some of the smartest around—but they could always still be smarter. In the rare instances where we care about performance, examining the compiled expansions is critical because it's hard to know how smart your lisp implementation is. In the above assembly, if we look carefully, we will see that it is actually performing an unnecessary read operation every time it performs a swap. The problem is that rotatef expands into a redundant access. A sufficiently smart compiler could have figured out that we already have this value in a register and could have avoided the array access. But mine didn't so I re-structured the code so it resulted in a more efficient expansion.

SN-TO-LAMBDA-FORM

(defun sn-to-lambda-form (sn)
  `(lambda (arr)
     #f
     (declare (type (simple-array fixnum) arr))
     ,@(mapcar
         #`(let ((a #1=(aref arr ,(car a1)))
                 (b #2=(aref arr ,(cadr a1))))
             (if (> a b)
               (setf #1# b
                     #2# a)))
         sn)
     arr))

Sn-to-lambda-form is the improved version of sn-to-lambda-form%. It creates temporary bindings for the variables read in so the array read instruction isn't re-performed for the swap operation. Here is the superior compiled expansion:

* (disassemble
    (compile nil
      (sn-to-lambda-form%
        (build-batcher-sn 3))))
...
;;; (LET ((A (AREF ARR 0)) (B (AREF ARR 2))) ...)
      2E:       MOV     EAX, [EDX+1]
      31:       MOV     ECX, [EDX+9]
      34:       CMP     EAX, ECX
      36:       JLE     L0
;;; (SETF (AREF ARR 0) B (AREF ARR 2) A)
      38:       MOV     [EDX+1], ECX
      3B:       MOV     [EDX+9], EAX
;;; (LET ((A (AREF ARR 0)) (B (AREF ARR 1))) ...)
      3E: L0:   MOV     EAX, [EDX+1]
      41:       MOV     ECX, [EDX+5]
      44:       CMP     EAX, ECX
      46:       JLE     L1
;;; (SETF (AREF ARR 0) B (AREF ARR 1) A)
      48:       MOV     [EDX+1], ECX
      4B:       MOV     [EDX+5], EAX
;;; (LET ((A (AREF ARR 1)) (B (AREF ARR 2))) ...)
      4E: L1:   MOV     EAX, [EDX+5]
      51:       MOV     ECX, [EDX+9]
      54:       CMP     EAX, ECX
      56:       JLE     L2
;;; (SETF (AREF ARR 1) B (AREF ARR 2) A)
      58:       MOV     [EDX+5], ECX
      5B:       MOV     [EDX+9], EAX
      5E: L2:   ...

Becoming familiar with your lisp compiler so that you know how efficient your macro expansions are is very important for writing efficient lisp. Disassemble, the source code to your lisp system, benchmarking tools like the time macro, and lots of patience is, unfortunately, the only way to really gain an intuition for how to write fast lisp code.

Expanding into a lambda form like our sn-to-lambda-form macro does might be the most obvious way to implement a compiler if you come from Blub languages. The source code to lambda form to compile to disassemble cycle feels very much like the edit, compile, disassemble cycle of Blub. You put source code in and get machine-code out. However, this approach could be more lispy. In lisp we generally create our compilers to be invisible—incorporated directly into other lisp programs. Ideally the compile function is never invoked until we want stuff to run fast. Macros shouldn't bother compiling all the way down to machine code, instead only enough to create a good expansion so that the compiler, whenever it is run, will have enough information available to make the complete program efficient.

We especially don't want to call compile at run-time. Compile is an expensive operation because of the many levels of macros that need to be expanded to compile something. Instead of calling compile at run-time, remember that lisp will have already compiled all the lambda forms inside a compiled function. Because of this ability to construct closures of already compiled code at run-time, it is easy to ensure that as much computation at compile-time is done as possible while still creating arbitrary functions (closures) at run-time.

SORTF

(defmacro! sortf (comparator &rest places)
  (if places
    `(tagbody
       ,@(mapcar
           #`(let ((,g!a #1=,(nth (car a1) places))
                   (,g!b #2=,(nth (cadr a1) places)))
               (if (,comparator ,g!b ,g!a)
                 (setf #1# ,g!b
                       #2# ,g!a)))
           (build-batcher-sn (length places))))))

Sortf is my favourite macro in this book. Not only is it concise, elegant, and an excellent demonstration of many of the macro techniques so far described, but it is also a useful piece of production code capable of performing extremely fast sorting operations. Best yet, this macro blends so nicely with lisp programs that it is effortless to use. We don't have to go far out of our way to benefit from this advanced lisp optimisation. Any time we need small, fixed-size data-sets sorted, this macro is trivial to incorporate, sometimes even easier than the sort function. Instead of expanding into a lambda form, sortf expands into a tagbody form because tagbody is the canonical progn that returns nil. Here is an expansion of sortf:

* (macroexpand
    '(sortf < a b c))

(LET ()
  (TAGBODY
    (LET ((#:A1824 A) (#:B1823 C))
      (IF (< #:B1823 #:A1824)
        (SETF A #:B1823 C #:A1824)))
    (LET ((#:A1824 A) (#:B1823 B))
      (IF (< #:B1823 #:A1824)
        (SETF A #:B1823 B #:A1824)))
    (LET ((#:A1824 B) (#:B1823 C))
      (IF (< #:B1823 #:A1824)
        (SETF B #:B1823 C #:A1824)))))
T

The interface design of sortf is from On Lisp, but it is so natural that almost every lisp programmer would implement it this way. The first argument is usually a symbol representing a comparison operator—typically something like <. This usually represents a function but, as Graham points out, could also represent a macro or special form since it is spliced directly into the function position of a list. We can even pass a lambda form since those are also allowed in the function position of a list²⁴:

* (let ((a -3) (b 2))
    (sortf (lambda (a b) (< (abs a) (abs b)))
      a b)
    (list a b))

(2 -3)

Also like Graham's macro, the arguments to be sorted are generalised variables. That means that we can use sortf to sort any sort of variables, not only those represented by symbols, but anything that can be setf. Here is an example:

* (let ((a 2) (b '(4)) (c #(3 1)))
    (sortf < a (car b) (aref c 0) (aref c 1))
    (format t "a=~a b=~a c=~a~%" a b c))

a=1 b=(2) c=#(3 4)
NIL

While Graham's sortf and our sortf compile the same source language, their expansions couldn't be more different. Graham's macro is arguably more correct than ours because it will only execute the code to access the places once. With Graham's sortf we can pass in variables with side-effects and have them evaluated only once as might be expected. For example, Graham's sortf will only increment i once when given the place (aref arr (incf i)). Graham's sortf works by copying every variable to be sorted into a temporary binding, using a bubble sort to sort these temporary bindings, then using setf expansions²⁵ to write the temporary variables back to the original places, now in sorted order. Instead, our sortf will evaluate each place form many times throughout the sort, so you are advised to not use places that have side-effects. Another consequence of this design is that if you are after efficiency, make sure your accessors are efficient. In particular, do not use long list accessors like caddr because they will end up traversing the list many times. With our implementation, we sort the arguments in-place, that is without any temporary bindings. Instead of bubble sort, which has a Big-O complexity of (O (expt N 2))²⁶ ^{We don't like infix.} , we use Batcher's far better merge exchange sort with its (O (* N (expt (log N) 2))). There are methods for constructing sorting networks of (O (* N (log N)))—the same as quick sort—but most use more operations for small network sizes than do Batcher's.

You will probably want to add a sharp-f fast declaration surrounding where you call sortf because it will not add this itself. If you want really fast sorts, ensure that the compiler knows the types of all your generalised variables being sorted. If you do specify types, always ensure that all the generalised variables are declared to be the same type. This is necessary because any element could potentially end up at any place.

But how do we know if this macro actually gives us any performance advantage over the sort function? We need to benchmark it. Benchmarking is discussed a lot because, especially to programmers, the timeless hobby of bullshitting is so enjoyable. Unfortunately, almost all benchmark results are useless. You are even advised to take the benchmark results from this book with a heavy grain of salt. That said, carefully crafted, controlled experiments that meter very slightly different versions of code running on the same lisp image on the same machine can sometimes be invaluable for understanding and fixing performance bottlenecks. Such metering is useful because not only can we tell which techniques are more efficient, but we can also tell just how much more efficient they are. Because they write code for us, macros are the best tool for setting up these experiments.

SORT-BENCHMARK-TIME

(defmacro sort-benchmark-time ()
  `(progn
     (setq sorter (compile nil sorter))
     (let ((arr (make-array
                n :element-type 'fixnum)))
       (time
         (loop for i from 1 to iters do
           (loop for j from 0 to (1- n) do
             (setf (aref arr j) (random n)))
           (funcall sorter arr))))))

The sort-benchmark-time macro is a component in our experiment. It expands into code that assumes either a lambda form or a function is bound to sorter and that this function will sort a fixnum array of size n. It then compiles this into a function and uses it to sort a randomly generated array iters iterations. The time macro is used to collect statistics on how long the sorting procedure takes.

DO-SORT-BENCHMARK

(defun do-sort-benchmark (n iters)
  (let ((rs (make-random-state *random-state*)))
    (format t "CL sort:~%")
    (let ((sorter
            '(lambda (arr)
                #f
                (declare (type (simple-array fixnum)
                               arr))
                (sort arr #'<))))
      (sort-benchmark-time))

    (setf *random-state* rs)
    (format t "sortf:~%")
    (let ((sorter
            `(lambda (arr)
                #f
                (declare (type (simple-array fixnum)
                               arr))
                (sortf <
                  ,@(loop for i from 0 to (1- n)
                          collect `(aref arr ,i)))
                arr)))
      (sort-benchmark-time))))

(compile 'do-sort-benchmark)

Do-sort-benchmark is our actual interface for performing the benchmarking. Given a data-set size n and an iteration count iters, it will meter both the COMMON LISP sort function and our sortf macro. It preserves the state of the random number generator and resets it after performing the sort measurement but before running the sortf one so that the random arrays to be sorted are identical. It is important that do-sort-benchmark is compiled when run so that there is the least noise possible in our measurements.

When run, do-sort-benchmark tells us not only that sortf is efficient, but also that general purpose sorting algorithms are not even in the same ball-park as sorting networks with respect to performance for small, fixed-size data-sets. We also notice that sortf doesn't cons which, in turn, results in less garbage collection run-time, contributing even greater gains in performance. Here are the results for data-sets of sizes 2, 3, 6, 9, 25, 49:

* (do-sort-benchmark 2 1000000)

CL sort:
; Evaluation took:
;   1.65 seconds of real time
;   8,000,064 bytes consed.

sortf:
; Evaluation took:
;   0.36 seconds of real time
;   0 bytes consed.

* (do-sort-benchmark 3 1000000)

CL sort:
; Evaluation took:
;   3.65 seconds of real time
;   24,000,128 bytes consed.

sortf:
; Evaluation took:
;   0.46 seconds of real time
;   0 bytes consed.

* (do-sort-benchmark 6 1000000)

CL sort:
; Evaluation took:
;   10.37 seconds of real time
;   124,186,832 bytes consed.

sortf:
; Evaluation took:
;   0.8 seconds of real time
;   0 bytes consed.

* (do-sort-benchmark 9 1000000)

CL sort:
; Evaluation took:
;   19.45 seconds of real time
;   265,748,544 bytes consed.

sortf:
; Evaluation took:
;   1.17 seconds of real time
;   0 bytes consed.

* (do-sort-benchmark 25 1000000)

CL sort:
; Evaluation took:
;   79.53 seconds of real time
;   1,279,755,832 bytes consed.

sortf:
; Evaluation took:
;   3.41 seconds of real time
;   0 bytes consed.

* (do-sort-benchmark 49 1000000)

CL sort:
; Evaluation took:
;   183.16 seconds of real time
;   3,245,024,984 bytes consed.

sortf:
; Evaluation took:
;   8.11 seconds of real time
;   0 bytes consed.

So for certain tasks, order of magnitude or better improvements to our system sorting routines are possible with sorting networks. These measurements are not intended to make our sort implementation look bad (it is actually an excellent sorting routine), but rather to show a realistic example of how smart programming with macros can result in substantial efficiency gains. Lisp macros let us easily and portably program smart. Blub languages take so much effort to program smart that Blub programmers almost always settle for programming dumb. In lisp, everything compiles down to lisp so there are never any barriers to what you can optimise. If anything is ever unacceptably slow, change it and make it faster. We almost never need things to run fast, but when we do, lisp is the solution.

MEDIANF

(defun medianf-get-best-sn (n)
  (case n
    ((0)  (error "Need more places for medianf"))
    ((9)  paeth-9-median-sn)
    ((25) paeth-25-median-sn)
    (t    (prune-sn-for-median n
            (build-batcher-sn n)))))

(defmacro! medianf (&rest places)
  `(progn
     ,@(mapcar
         #`(let ((,g!a #1=,(nth (car a1) places))
                 (,g!b #2=,(nth (cadr a1) places)))
             (if (< ,g!b ,g!a)
               (setf #1# ,g!b
                     #2# ,g!a)))
         (medianf-get-best-sn (length places)))
     ,(nth (floor (1- (length places)) 2) ; lower
           places)))

Another macro similar to sortf is medianf, which uses our pruned median selection networks or Paeth's hand-crafted median networks to sort the places just enough to ensure that the median element is in its correct position. In the case of even network sizes, both the lower and upper median will be in the correct place. Unlike sortf, which always returns nil, medianf will return the value of the lower median (which is the same as the upper median for odd network sizes).

As we said earlier, sortf and medianf sort any kind of places you can setf. In the case of variables stored in registers, this gives lisp the opportunity to produce sorting code which doesn't even access memory. For example, here is the compiled expansion for medianf on three fixnum places:

* (dis ((fixnum a) (fixnum b) (fixnum c))
    #f
    (medianf a b c))
...
;;; (MEDIANF A B C)
      34:       MOV     EBX, EAX
      36:       CMP     EDX, EAX
      38:       JL      L4
      3A: L0:   MOV     EBX, EAX
      3C:       CMP     ECX, EAX
      3E:       JL      L3
      40: L1:   MOV     EAX, ECX
      42:       CMP     EDX, ECX
      44:       JNL     L2
      46:       MOV     ECX, EDX
      48:       MOV     EDX, EAX
      4A: L2:
...
      5B: L3:   MOV     EAX, ECX
      5D:       MOV     ECX, EBX
      5F:       JMP     L1
      61: L4:   MOV     EAX, EDX
      63:       MOV     EDX, EBX
      65:       JMP     L0

Lisp has more potential for efficient code than any other language and it's all thanks to macros. Because they are so good at creating controlled metering experiments, macros are also the solution to determining which techniques produce more efficient outcomes. Compilers are programs that write programs and macros are the best way to do that.