binfmt: binary format parser generator
#lang binfmt | package: binfmt-lib |
This package provides a #lang for building binary format parsers with support for limited context-sensitivity.
1 Example
Here is a parser definition for the ID3v1 format:
#lang binfmt id3 = magic title artist album year comment genre; magic = 'T' 'A' 'G'; title = u8{30}; artist = u8{30}; album = u8{30}; year = u8{4}; comment = u8{30}; genre = u8;
Assuming this is saved in a file called "id3v1.b", you can import it from Racket and apply any of the definitions to an input port in order to parse its contents:
> (require "id3v1.b")
You can parse the magic header by itself:
> (magic (open-input-bytes #"TAG")) '((char_1 . #\T) (char_2 . #\A) (char_3 . #\G))
Or a full tag:
> (define data (bytes-append #"TAGCreative Commons Song Improbulus N" #"/A 2005Take on O Mio Babbino Caro! g"))
> (define tree (id3 (open-input-bytes data)))
And inspect the resulting parse tree:
> (map car tree) '(magic_1 title_1 artist_1 album_1 year_1 comment_1 genre_1)
> (define ref (compose1 cdr assq)) > (take (ref 'title_1 tree) 8) '(67 114 101 97 116 105 118 101)
> (apply bytes (ref 'title_1 tree)) #"Creative Commons Song "
Finally, parsing invalid data results in a syntax error:
> (id3 (open-input-bytes #"TAG...")) parse failed
expected 'u8' but found EOF
in: string
position: 7
Every definition automatically creates an un-parser. Un-parsers are functions that take a parse tree as input and serialize the data to an output port. They are named by prepending un- to the name of a definition.
> (define bs (call-with-output-bytes (lambda (out) (un-id3 tree out))))
> (for ([n (in-range 0 (bytes-length bs) 64)]) (println (subbytes bs n (+ n 64))))
#"TAGCreative Commons Song Improbulus N"
#"/A 2005Take on O Mio Babbino Caro! g"
2 Grammar and Operation
The grammar for binfmt is as follows:
| ‹def› | ::= | ‹alt› {| ‹alt›}* ; |
| ‹alt› | ::= | expr+ |
| ‹expr› | ::= | ‹term› | ‹star› | ‹plus› | ‹repeat› |
| ‹star› | ::= | ‹term› * |
| ‹plus› | ::= | ‹term› + |
| ‹repeat› | ::= | ‹term› { ‹id› | ‹natural› } |
| ‹term› | ::= | ‹byte› |
|
| | | ‹char› |
|
| | | ‹id› |
| ‹byte› | ::= | an integer between 0x00 and 0xFF |
| ‹char› | ::= | ' ascii character ' |
| ‹id› | ::= | any identifier |
| ‹natural› | ::= | any natural number |
Within an ‹alt›, each ‹expr› is assigned a unique name based on its ‹id›: the first time an ‹id› appers in an alt, _1 is appended to its name, the second time _2, and so on.
Alternatives containing two or more ‹expr›s parse to an association list mapping ‹expr› names (as defined above) to parse results. Alternatives containing a single ‹expr› collapse to the result of the ‹expr›.
The ‹repeat› syntax can either repeat a parser an exact number of times or it can repeat it based on the result of a previous parser within the same ‹alt›. For example, the following parser parses a i8 to determine the length of a string, then parses that number of u8s following it.
#lang binfmt string = strlen u8{strlen_1}; strlen = i8;
Negative length values are allowed, in which case they’re treated the same as 0. The parser above would parse #"\377" to an empty string.
The following parsers are built-in:
TODO
u8, u16, u32, u64, u16le, u32le, u64le, u16be, u32be, u64be
i8, i16, i32, i64, i16le, i32le, i64le, i16be, i32be, i64be
f32, f64, f32le, f64le, f32be, f64be
uvarint32, uvarint64
varint32, varint64
nul, eof
Parsers for ‹alt›s may backtrack, but backtracking is only supported on file and string input ports. All other types of ports (eg. pipes and custom ports that don’t support setting a file position) cause backtracking to fail with a parsing error.
On parse and unparse failure, an exn:fail:binfmt? error is raised.
3 Reference
(require binfmt/runtime) | package: binfmt-lib |
procedure
(exn:fail:binfmt? v) → boolean?
v : any/c
procedure
(exn:fail:binfmt-id e) → symbol?
e : exn:fail:binfmt?