19 Performance
Alan Perlis famously quipped “Lisp programmers know the value of
everything and the cost of nothing.” A Racket programmer knows, for
example, that a lambda anywhere in a program produces a value
that is closed over its lexical environment—
In this chapter, we narrow the gap by explaining details of the Racket compiler and runtime system and how they affect the runtime and memory performance of Racket code.
19.1 Performance in DrRacket
By default, DrRacket instruments programs for debugging, and debugging instrumentation (provided by the Errortrace: Debugging and Profiling library) can significantly degrade performance for some programs. Even when debugging is disabled through the Choose Language... dialog’s Show Details panel, the Preserve stacktrace checkbox is clicked by default, which also affects performance. Disabling debugging and stacktrace preservation provides performance results that are more consistent with running in plain racket.
Even so, DrRacket and programs developed within DrRacket use the same Racket virtual machine, so garbage collection times (see Memory Management) may be longer in DrRacket than when a program is run by itself, and DrRacket threads may impede execution of program threads. For the most reliable timing results for a program, run in plain racket instead of in the DrRacket development environment. Non-interactive mode should be used instead of the REPL to benefit from the module system. See Modules and Performance for details.
19.2 Racket Virtual Machine Implementations
Racket is available in two implementations, CS and BC:
CS is the current default implementation. It is a newer implementation that builds on Chez Scheme as its core virtual machine. This implementation performs better than the BC implementation for most programs.
For this implementation, (system-type 'vm) reports 'chez-scheme and (system-type 'gc) reports 'cs.
BC is an older implementation, and was the default until version 8.0. The implementation features a compiler and runtime written in C, with a precise garbage collector and a just-in-time compiler (JIT) on most platforms.
For this implementation, (system-type 'vm) reports 'racket.
The BC implementation itself has two variants, 3m and CGC:
3m is the normal BC variant with a precise garbage collector.
For this variant, (system-type 'gc) reports '3m.
CGC is the oldest variant. It’s the same basic implementation as 3m (i.e., the same virtual machine), but compiled to rely on a “conservative” garbage collector, which affects the way that Racket interacts with C code. See CGC versus 3m in Inside: Racket C API for more information.
For this variant, (system-type 'gc) reports 'cgc.
In general, Racket programs should run the same in all variants. Furthermore, the performance characteristics of Racket program should be similar in the CS and BC implementations. The cases where a program may depend on the implementation will typically involve interactions with foreign libraries; in particular, the Racket C API described in Inside: Racket C API is different for the CS implementation versus the BC implementation.
19.3 Bytecode, Machine Code, and Just-in-Time (JIT) Compilers
Every definition or expression to be evaluated by Racket is compiled to an internal bytecode format, although “bytecode” may actually be native machine code. In interactive mode, this compilation occurs automatically and on-the-fly. Tools like raco make and raco setup marshal compiled bytecode to a file, so that you do not have to compile from source every time that you run a program. See Compilation and Configuration: raco for more information on generating bytecode files.
The bytecode compiler applies all standard optimizations, such as constant propagation, constant folding, inlining, and dead-code elimination. For example, in an environment where + has its usual binding, the expression (let ([x 1] [y (lambda () 4)]) (+ 1 (y))) is compiled the same as the constant 5.
For the CS implementation of Racket, the main bytecode format is non-portable machine code. For the BC implementation of Racket, bytecode is portable in the sense that it is machine-independent. Setting current-compile-target-machine to #f selects a separate machine-independent and variant-independent format on all Racket implementations, but running code in that format requires an additional internal conversion step to the implementation’s main bytecode format.
Machine-independent bytecode for the BC implementation is further compiled to native code via a just-in-time or JIT compiler. The JIT compiler substantially speeds programs that execute tight loops, arithmetic on small integers, and arithmetic on inexact real numbers. Currently, JIT compilation is supported for x86, x86_64 (a.k.a. AMD64), 32-bit ARM, and 32-bit PowerPC processors. The JIT compiler can be disabled via the eval-jit-enabled parameter or the --no-jit/-j command-line flag for racket. Setting eval-jit-enabled to #f has no effect on the CS implementation of Racket.
The JIT compiler works incrementally as functions are applied, but the JIT compiler makes only limited use of run-time information when compiling procedures, since the code for a given module body or lambda abstraction is compiled only once. The JIT’s granularity of compilation is a single procedure body, not counting the bodies of any lexically nested procedures. The overhead for JIT compilation is normally so small that it is difficult to detect.
For information about viewing intermediate Racket code representations, especially for the CS implementation, see Inspecting Compiler Passes.
19.4 Modules and Performance
The module system aids optimization by helping to ensure that identifiers have the usual bindings. That is, the + provided by racket/base can be recognized by the compiler and inlined. In contrast, in a traditional interactive Scheme system, the top-level + binding might be redefined, so the compiler cannot assume a fixed + binding (unless special flags or declarations are used to compensate for the lack of a module system).
Even in the top-level environment, importing with require enables some inlining optimizations. Although a + definition at the top level might shadow an imported +, the shadowing definition applies only to expressions evaluated later.
Within a module, inlining and constant-propagation optimizations take additional advantage of the fact that definitions within a module cannot be mutated when no set! is visible at compile time. Such optimizations are unavailable in the top-level environment. Although this optimization within modules is important for performance, it hinders some forms of interactive development and exploration. The compile-enforce-module-constants parameter disables the compiler’s assumptions about module definitions when interactive exploration is more important. See Assignment and Redefinition for more information.
The compiler may inline functions or propagate constants across module boundaries. To avoid generating too much code in the case of function inlining, the compiler is conservative when choosing candidates for cross-module inlining; see Function-Call Optimizations for information on providing inlining hints to the compiler.
The later section letrec Performance provides some additional caveats concerning inlining of module bindings.
19.5 Function-Call Optimizations
When the compiler detects a function call to an immediately visible function, it generates more efficient code than for a generic call, especially for tail calls. For example, given the program
(letrec ([odd (lambda (x) (if (zero? x) #f (even (sub1 x))))] [even (lambda (x) (if (zero? x) #t (odd (sub1 x))))]) (odd 40000000))
the compiler can detect the odd–even loop and produce code that runs much faster via loop unrolling and related optimizations.
Within a module form, defined variables are lexically scoped like letrec bindings, and definitions within a module therefore permit call optimizations, so
(define (odd x) ....) (define (even x) ....)
within a module would perform the same as the letrec version.
For direct calls to functions with keyword arguments, the compiler can typically check keyword arguments statically and generate a direct call to a non-keyword variant of the function, which reduces the run-time overhead of keyword checking. This optimization applies only for keyword-accepting procedures that are bound with define.
For immediate calls to functions that are small enough, the compiler may inline the function call by replacing the call with the body of the function. In addition to the size of the target function’s body, the compiler’s heuristics take into account the amount of inlining already performed at the call site and whether the called function itself calls functions other than simple primitive operations. When a module is compiled, some functions defined at the module level are determined to be candidates for inlining into other modules; normally, only trivial functions are considered candidates for cross-module inlining, but a programmer can wrap a function definition with begin-encourage-inline to encourage inlining of the function.
Primitive operations like pair?, car, and cdr are inlined at the machine-code level by the bytecode or JIT compiler. See also the later section Fixnum and Flonum Optimizations for information about inlined arithmetic operations.
19.6 Mutation and Performance
Using set! to mutate a variable can lead to bad performance. For example, the microbenchmark
#lang racket/base (define (subtract-one x) (set! x (sub1 x)) x) (time (let loop ([n 4000000]) (if (zero? n) 'done (loop (subtract-one n)))))
runs much more slowly than the equivalent
#lang racket/base (define (subtract-one x) (sub1 x)) (time (let loop ([n 4000000]) (if (zero? n) 'done (loop (subtract-one n)))))
In the first variant, a new location is allocated for x on every iteration, leading to poor performance. A more clever compiler could unravel the use of set! in the first example, but since mutation is discouraged (see Guidelines for Using Assignment), the compiler’s effort is spent elsewhere.
More significantly, mutation can obscure bindings where inlining and constant-propagation might otherwise apply. For example, in
(let ([minus1 #f]) (set! minus1 sub1) (let loop ([n 4000000]) (if (zero? n) 'done (loop (minus1 n)))))
the set! obscures the fact that minus1 is just another name for the built-in sub1.
19.7 letrec Performance
When letrec is used to bind only procedures and literals, then the compiler can treat the bindings in an optimal manner, compiling uses of the bindings efficiently. When other kinds of bindings are mixed with procedures, the compiler may be less able to determine the control flow.
For example,
(letrec ([loop (lambda (x) (if (zero? x) 'done (loop (next x))))] [junk (display loop)] [next (lambda (x) (sub1 x))]) (loop 40000000))
likely compiles to less efficient code than
(letrec ([loop (lambda (x) (if (zero? x) 'done (loop (next x))))] [next (lambda (x) (sub1 x))]) (loop 40000000))
In the first case, the compiler likely does not know that display does not call loop. If it did, then loop might refer to next before the binding is available.
This caveat about letrec also applies to definitions of functions and constants as internal definitions or in modules. A definition sequence in a module body is analogous to a sequence of letrec bindings, and non-constant expressions in a module body can interfere with the optimization of references to later bindings.
19.8 Fixnum and Flonum Optimizations
A fixnum is a small exact integer. In this case, “small” depends on the platform. For a 32-bit machine, numbers that can be expressed in 29-30 bits plus a sign bit are represented as fixnums. On a 64-bit machine, 60-62 bits plus a sign bit are available.
A flonum is used to represent any inexact real number. They correspond to 64-bit IEEE floating-point numbers on all platforms.
Inlined fixnum and flonum arithmetic operations are among the most important advantages of the compiler. For example, when + is applied to two arguments, the generated machine code tests whether the two arguments are fixnums, and if so, it uses the machine’s instruction to add the numbers (and check for overflow). If the two numbers are not fixnums, then it checks whether both are flonums; in that case, the machine’s floating-point operations are used directly. For functions that take any number of arguments, such as +, inlining works for two or more arguments (except for -, whose one-argument case is also inlined) when the arguments are either all fixnums or all flonums.
Flonums are typically boxed, which means that memory is allocated to hold every result of a flonum computation. Fortunately, the generational garbage collector (described later in Memory Management) makes allocation for short-lived results reasonably cheap. Fixnums, in contrast are never boxed, so they are typically cheap to use.
See Parallelism with Futures for an example use of flonum-specific operations.
The racket/flonum library provides flonum-specific operations, and combinations of flonum operations allow the compiler to generate code that avoids boxing and unboxing intermediate results. Besides results within immediate combinations, flonum-specific results that are bound with let and consumed by a later flonum-specific operation are unboxed within temporary storage. Unboxing applies most reliably to uses of a flonum-specific operation with two arguments. Finally, the compiler can detect some flonum-valued loop accumulators and avoid boxing of the accumulator. Unboxing of local bindings and accumulators is not supported by the BC implementation’s JIT for PowerPC.
For some loop patterns, the compiler may need hints to enable unboxing. For example:
(define (flvector-sum vec init) (let loop ([i 0] [sum init]) (if (fx= i (flvector-length vec)) sum (loop (fx+ i 1) (fl+ sum (flvector-ref vec i))))))
The compiler may not be able to unbox sum in this example for two reasons: it cannot determine locally that its initial value from init will be a flonum, and it cannot tell locally that the eq? identity of the result sum is irrelevant. Changing the reference init to (fl+ init) and changing the result sum to (fl+ sum) gives the compiler hints and license to unbox sum.
The bytecode decompiler (see raco decompile: Decompiling Bytecode) for the BC implementation annotates combinations where the JIT can avoid boxes with #%flonum, #%as-flonum, and #%from-flonum. For the CS variant, the “bytecode” decompiler shows machine code, but install the "disassemble" package to potentially see the machine code as machine-specific assembly code. See also Inspecting Compiler Passes.
The racket/unsafe/ops library provides unchecked fixnum- and flonum-specific operations. Unchecked flonum-specific operations allow unboxing, and sometimes they allow the compiler to reorder expressions to improve performance. See also Unchecked, Unsafe Operations, especially the warnings about unsafety.
19.9 Unchecked, Unsafe Operations
The racket/unsafe/ops library provides functions that are like other functions in racket/base, but they assume (instead of checking) that provided arguments are of the right type. For example, unsafe-vector-ref accesses an element from a vector without checking that its first argument is actually a vector and without checking that the given index is in bounds. For tight loops that use these functions, avoiding checks can sometimes speed the computation, though the benefits vary for different unchecked functions and different contexts.
Beware that, as “unsafe” in the library and function names suggest, misusing the exports of racket/unsafe/ops can lead to crashes or memory corruption.
19.10 Foreign Pointers
The ffi/unsafe library provides functions for unsafely reading and writing arbitrary pointer values. The compiler recognizes uses of ptr-ref and ptr-set! where the second argument is a direct reference to one of the following built-in C types: _int8, _int16, _int32, _int64, _double, _float, and _pointer. Then, if the first argument to ptr-ref or ptr-set! is a C pointer (not a byte string), then the pointer read or write is performed inline in the generated code.
The bytecode compiler will optimize references to integer
abbreviations like _int to C types like
_int32—
Pointer reads and writes using _float or _double are not currently subject to unboxing optimizations.
19.11 Regular Expression Performance
When a string or byte string is provided to a function like regexp-match, then the string is internally compiled into a regexp value. Instead of supplying a string or byte string multiple times as a pattern for matching, compile the pattern once to a regexp value using regexp, byte-regexp, pregexp, or byte-pregexp. In place of a constant string or byte string, write a constant regexp using an #rx or #px prefix.
(define (slow-matcher str) (regexp-match? "[0-9]+" str)) (define (fast-matcher str) (regexp-match? #rx"[0-9]+" str)) (define (make-slow-matcher pattern-str) (lambda (str) (regexp-match? pattern-str str))) (define (make-fast-matcher pattern-str) (define pattern-rx (regexp pattern-str)) (lambda (str) (regexp-match? pattern-rx str)))
19.12 Memory Management
The CS (default) and BC Racket virtual machines each use a modern, generational garbage collector that makes allocation relatively cheap for short-lived objects. The CGC variant of BC uses a conservative garbage collector which facilitates interaction with C code at the expense of both precision and speed for Racket memory management.
Although memory allocation is reasonably cheap, avoiding allocation altogether is often faster. One particular place where allocation can be avoided sometimes is in closures, which are the run-time representation of functions that contain free variables. For example,
(let loop ([n 40000000] [prev-thunk (lambda () #f)]) (if (zero? n) (prev-thunk) (loop (sub1 n) (lambda () n))))
allocates a closure on every iteration, since (lambda () n) effectively saves n.
The compiler can eliminate many closures automatically. For example, in
(let loop ([n 40000000] [prev-val #f]) (let ([prev-thunk (lambda () n)]) (if (zero? n) prev-val (loop (sub1 n) (prev-thunk)))))
no closure is ever allocated for prev-thunk, because its only application is visible, and so it is inlined. Similarly, in
(let n-loop ([n 400000]) (if (zero? n) 'done (let m-loop ([m 100]) (if (zero? m) (n-loop (sub1 n)) (m-loop (sub1 m))))))
then the expansion of the let form to implement m-loop involves a closure over n, but the compiler automatically converts the closure to pass itself n as an argument instead.
19.13 Reachability and Garbage Collection
In general, Racket re-uses the storage for a value when the garbage collector can prove that the object is unreachable from any other (reachable) value. Reachability is a low-level, abstraction-breaking concept, and thus it requires detailed knowledge of the runtime system to predict exactly when values are reachable from each other. But generally one value is reachable from a second one when there is some operation to recover the original value from the second one.
To help programmers understand when an object is no longer reachable and its storage can be reused, Racket provides make-weak-box and weak-box-value, the creator and accessor for a one-record struct that the garbage collector treats specially. An object inside a weak box does not count as reachable, and so weak-box-value might return the object inside the box, but it might also return #f to indicate that the object was otherwise unreachable and garbage collected. Note that unless a garbage collection actually occurs, the value will remain inside the weak box, even if it is unreachable.
#lang racket (struct fish (weight color) #:transparent) (define f (fish 7 'blue)) (define b (make-weak-box f)) (printf "b has ~s\n" (weak-box-value b)) (collect-garbage) (printf "b has ~s\n" (weak-box-value b))
#lang racket (struct fish (weight color) #:transparent) (define f (fish 7 'blue)) (define b (make-weak-box f)) (printf "b has ~s\n" (weak-box-value b)) (set! f #f) (collect-garbage) (printf "b has ~s\n" (weak-box-value b))
Small integers (recognizable with fixnum?) are always available without explicit allocation. From the perspective of the garbage collector and weak boxes, their storage is never reclaimed. (Due to clever representation techniques, however, their storage does not count towards the space that Racket uses. That is, they are effectively free.)
Procedures where the compiler can see all of their call sites may never be allocated at all (as discussed above). Similar optimizations may also eliminate the allocation for other kinds of values.
Interned symbols are allocated only once (per place). A table inside Racket tracks this allocation so a symbol may not become garbage because that table holds onto it.
Reachability is only approximate with the CGC collector (i.e., a value may appear reachable to that collector when there is, in fact, no way to reach it anymore).
19.14 Weak Boxes and Testing
One important use of weak boxes is in testing that some abstraction properly releases storage for data it no longer needs, but there is a gotcha that can easily cause such test cases to pass improperly.
Imagine you’re designing a data structure that needs to hold onto some value temporarily but then should clear a field or somehow break a link to avoid referencing that value so it can be collected. Weak boxes are a good way to test that your data structure properly clears the value. That is, you might write a test case that builds a value, extracts some other value from it (that you hope becomes unreachable), puts the extracted value into a weak-box, and then checks to see if the value disappears from the box.
#lang racket (let* ([fishes (list (fish 8 'red) (fish 7 'blue))] [wb (make-weak-box (list-ref fishes 0))]) (collect-garbage) (printf "still there? ~s\n" (weak-box-value wb)))
#lang racket (let* ([fishes (list (fish 8 'red) (fish 7 'blue))] [wb (make-weak-box (list-ref fishes 0))]) (collect-garbage) (printf "still there? ~s\n" (weak-box-value wb)) (printf "fishes is ~s\n" fishes))
19.15 Reducing Garbage Collection Pauses
By default, Racket’s generational garbage collector creates brief pauses for frequent minor collections, which inspect only the most recently allocated objects, and long pauses for infrequent major collections, which re-inspect all memory.
For some applications, such as animations and games, long pauses due to a major collection can interfere unacceptably with a program’s operation. To reduce major-collection pauses, the 3m garbage collector supports incremental garbage-collection mode, and the CS garbage collector supports a useful approximation:
In 3m’s incremental mode, minor collections create longer (but still relatively short) pauses by performing extra work toward the next major collection. If all goes well, most of a major collection’s work has been performed by minor collections the time that a major collection is needed, so the major collection’s pause is as short as a minor collection’s pause. Incremental mode tends to run more slowly overall, but it can provide much more consistent real-time behavior.
In CS’s incremental mode, objects are never promoted out of the category of “recently allocated,” although there are degrees of “recently” so that most minor collections can still skip recent-but-not-too-recent objects. In the common case that most of the memory use for animation or game is allocated on startup (including its code and the code of the Racket runtime system), a major collection may never become necessary.
If the PLT_INCREMENTAL_GC environment variable is set to a value that starts with 0, n, or N when Racket starts, incremental mode is permanently disabled. For 3m, if the PLT_INCREMENTAL_GC environment variable is set to a value that starts with 1, y, or Y when Racket starts, incremental mode is permanently enabled. Since incremental mode is only useful for certain parts of some programs, however, and since the need for incremental mode is a property of a program rather than its environment, the preferred way to enable incremental mode is with (collect-garbage 'incremental).
Calling (collect-garbage 'incremental) does not perform an immediate garbage collection, but instead requests that each minor collection perform incremental work up to the next major collection (unless incremental model is permanently disabled). The request expires with the next major collection. Make a call to (collect-garbage 'incremental) in any repeating task within an application that needs to be responsive in real time. Force a full collection with (collect-garbage) just before an initial (collect-garbage 'incremental) to initiate incremental mode from an optimal state.
To check whether incremental mode is in use and how it affects pause times, enable debug-level logging output for the GC topic. For example,
racket -W "debug@GC error" main.rkt
runs "main.rkt" with garbage-collection logging to stderr (while preserving error-level logging for all topics). Minor collections are reported by min lines, increment-mode minor collections on 3m are reported with mIn lines, and major collections are reported with MAJ lines.