Extracting binary data from bytestrings using match🔗ℹ

This module introduces a new match pattern for matching and destructuring binary data encoded in a bytestring.

The API should be considered very alpha and open to incompatible changes.

Some similar packages include xenomorph and the "#lang" based binfmt.

1 The binary match pattern🔗ℹ

syntax
(binary byte-pattern ...+ maybe-rest)

byte-pattern = (bytes pat length)
| (zero-padded pat length)
| (until-byte pat byte)
| (until-byte* pat byte)
| (length-prefixed pat)
| (length-prefixed pat prefix-length endianness)
| (number-type pat)
| (number-type pat endianness)
| control-pattern

maybe-rest =
| (rest* pat)

control-pattern = (get-offset pat)
| (set-offset! offset)

number-type = s8
| u8
| s16
| u16
| s32
| u32
| u64
| s64
| f32
| f64

prefix-length = u8
| u16
| u32
| u64

endianness = big-endian
| little-endian
| native-endian
| host-order
| network-order

   byte : byte?
   length : (and/c fixnum? positive?)
   offset : (and/c fixnum? (>=/c 0))

A match extender that, when matched against a bytestring, tries to destructure it according to the given spec and match extracted values against given match patterns.

An example:

(match #"\17\240bc"
((binary (s16 num big-endian) (bytes rest 2))
(list num rest))) ; (4000 #"bc")

bytes extracts a fixed-width field. zero-padded extracts a fixed-width field and strips trailing 0 bytes. until-byte extracts bytes until the given delimiter byte is encountered. until-byte* is the same but a failure to find the delimiter is not a match failure. length-prefixed reads a length header and then that many bytes. It defaults to the 9P protocol specification of a 2 byte little-endian length if not explicitly specified.

The number patterns should hopefully be self explanatory.

rest* takes any remaining bytes at the end of the bytestring after everything else is matched; if there are no extra bytes, it applies an empty bytestring to its pattern.

Normally, matching starts with the first byte in the bytestring. (set-offset! where) changes the location (To facilitate matching bytestrings with multiple records), and get-offset will save the current index at that point in the matching.

A more complex example, that matches an IPv4 header:

(parameterize ([binary-match-default-endianness 'network-order])
  (match header
    ((binary
      (u8 (app byte->nybbles version header-length)) (u8 service-type) (u16 total-length)
      (u16 identification) (u16 flags+fragment)
      (u8 ttl) (u8 protocol) (u16 checksum)
      (bytes (app make-ip-address source-address) 4)
      (u32 (app (lambda (n) (make-ip-address n 4)) dest-address))
      (rest* options))
     (list version header-length service-type total-length ttl protocol
           (ip-address->string source-address) (ip-address->string dest-address)
           options))))

2 Additional functions🔗ℹ

procedure
(byte->nybbles b) →
byte? byte?
b : byte?

Splits a single byte into two 4-bit nybbles. The upper 4 bits is the first value, the lower 4 is the second.

parameter
(binary-match-default-endianness)
→ (or/c 'big-endian 'little-endian 'native-endian)
(binary-match-default-endianness endianness) → void?
endianness : (or/c 'big-endian 'little-endian 'native-endian 'network-order 'host-order)
= 'native-endian

A parameter that controls the endianness used by numeric patterns when one isn’t explicitly given.