Lexer and yacc examples
by Petter Olav Pripp <petter.pripp@yahoo.com>
Copyright (C) 2024 - Petter Olav Pripp
The source code is distributed under the GNU General Public License.
The source code is at
https://github.com/petterpripp/lex-yacc-example
Any suggestion or corrections are welcome.
1 Description
The examples show use of lexer and yacc-style parser. Useful for making syntax in a domain-specific language.
lex-yacc-example/rpcalc
lex-yacc-example/calc
lex-yacc-example/mfcalc
2 Reverse Polish Notation Calculator
Based on GNU Bison RPN Calc example.
https://www.gnu.org/software/bison/manual/bison.html#RPN-Calc
The example is that of a simple double-precision Reverse Polish Notation calculator (a calculator using postfix operators). This example provides a good starting point, since operator precedence is not an issue.
2.1 BNF
| ‹exp› | ::= | ‹number› |
|
| | | ‹exp› ‹exp› + |
|
| | | ‹exp› ‹exp› - |
|
| | | ‹exp› ‹exp› * |
|
| | | ‹exp› ‹exp› / |
|
| | | ‹exp› ‹exp› ^ |
|
| | | ‹exp› - |
2.2 Lexer
The lexer translate code to tokens. This will be input to the parser. Below is the full code for the lexer. In the next sections we will look into the code.
#lang racket (require parser-tools/lex (prefix-in : parser-tools/lex-sre)) (define-tokens value-tokens (NUMBER)) (define-empty-tokens op-tokens (EOF ADD SUBTRACT PRODUCT DIVISION POWER NEG)) (define next-token (lexer-src-pos [(eof) (token-EOF)] [(:+ whitespace) (return-without-pos (next-token input-port))] [#\+ (token-ADD)] [#\- (token-SUBTRACT)] [#\* (token-PRODUCT)] [#\/ (token-DIVISION)] [#\^ (token-POWER)] [#\n (token-NEG)] [(:: (:+ numeric) (:* (:: #\. (:+ numeric) ))) (token-NUMBER (string->number lexeme))])) (provide value-tokens op-tokens next-token)
2.2.1 Two types of tokens
(define-tokens value-tokens (NUMBER)) (define-empty-tokens op-tokens (EOF ADD SUBTRACT PRODUCT DIVISION POWER NEG))
Value tokens combines the token-id and the value.
Empty tokens is only token-id.
2.2.2 lexer-src-pos
The lexer uses regular expressions from (require parser-tools/lex (prefix-in : parser-tools/lex-sre)). When multiple patterns match, a lexer will choose the longest match, breaking ties in favor of the rule appearing first. The lexer will return tokens with source information. Below is explanation of some of the rules.
[#\+ (token-ADD)]
When lexer finds ’+’ it will return token ADD.
[(:+ whitespace) (return-without-pos (next-token input-port))]
Recursively call the lexer on the remaining input after a tab or space. Returning the result of that operation. This effectively skips all whitespace.
[(:: (:+ numeric) (:* (:: #\. (:+ numeric)))) (token-NUMBER (string->number lexeme))]
The lexer return both token-id NUMBER and the number combined to one value-token.
2.2.3 Testing the lexer
#lang racket (require "lexer.rkt" parser-tools/lex) (define (lex-test ip) (port-count-lines! ip) (letrec ([one-line (lambda () (let ([result (next-token ip)]) (unless (equal? (position-token-token result) 'EOF) (printf "~a\n" result) (one-line) )))]) (one-line))) (define (my-lex-test str) (lex-test (open-input-string str))) (provide my-lex-test)
> (my-lex-test "1 + * -")
#(struct:position-token #(struct:token NUMBER 1) #(struct:position 1 1 0) #(struct:position 2 1 1))
#(struct:position-token ADD #(struct:position 3 1 2) #(struct:position 4 1 3))
#(struct:position-token PRODUCT #(struct:position 5 1 4) #(struct:position 6 1 5))
#(struct:position-token SUBTRACT #(struct:position 7 1 6) #(struct:position 8 1 7))
> (my-lex-test "3.14 +\n ^ # -")
#(struct:position-token #(struct:token NUMBER 3.14) #(struct:position 1 1 0) #(struct:position 5 1 4))
#(struct:position-token ADD #(struct:position 6 1 5) #(struct:position 7 1 6))
#(struct:position-token POWER #(struct:position 9 2 1) #(struct:position 10 2 2))
lexer: No match found in input starting with: #
2.3 Parser
This is the full code for the parser. In the next sections we will look into the code.
#lang racket (require parser-tools/yacc "lexer.rkt") (define myparser (parser (start exp) (end EOF) (tokens value-tokens op-tokens ) (src-pos) (error (lambda (a b c d e) (begin (printf "a = ~a\nb = ~a\nc = ~a\nd = ~a\ne = ~a\n" a b c d e) (void)))) (grammar (exp [(NUMBER) $1] [(exp exp ADD) (+ $1 $2)] [(exp exp SUBTRACT) (- $1 $2)] [(exp exp PRODUCT) (* $1 $2)] [(exp exp DIVISION) (/ $1 $2)] [(exp exp POWER) (expt $1 $2)] [(exp NEG) (- $1)])))) (define (parse ip) (port-count-lines! ip) (myparser (lambda () (next-token ip)))) (provide parse )
2.3.1 Explanation of exp grammar
Grammar can have many grouping. In this example it has only exp. The exp grouping has several rules, one for each kind of expression. The first rule handles the simplest expressions: those that are just numbers. The second handles an addition-expression, which looks like two expressions followed by a plus-sign. The third handles subtraction, and so on.
(grammar (exp [(NUMBER) $1] [(exp exp ADD) (+ $1 $2)] [(exp exp SUBTRACT) (- $1 $2)] [(exp exp PRODUCT) (* $1 $2)] [(exp exp DIVISION) (/ $1 $2)] [(exp exp POWER) (expt $1 $2)] [(exp NEG) (- $1)]))
The rules have actions that compute the value of the expression in terms of the value of its parts. For example, in the rule for addition, $1 refers to the first component exp and $2 refers to the second one.
2.3.2 The other components of parser
(start exp)
(end EOF)
(tokens value-tokens op-tokens)
(src-pos)
(error (lambda (a b c d e) (begin (printf "a = ~a\nb = ~a\nc = ~a\nd = ~a\ne = ~a\n" a b c d e) (void))))
2.3.3 parse function
(define (parse ip) (port-count-lines! ip) (myparser (lambda () (next-token ip))))
Wrapper around the parser. It handles the call to the lexer.
(port-count-lines! ip)
2.3.4 Testing the parser
> (parse (open-input-string "20 3 5 * 7 + + ")) 42
> (parse (open-input-string "2 4 ^ n ")) -16
> (parse (open-input-string "1 2 3 / + * "))
a = #t
b = PRODUCT
c = #f
d = #(struct:position 11 1 10)
e = #(struct:position 12 1 11)
parser: Cannot continue after error
2.4 Language
We wrap it up with making a lex-yacc-example/rpcalc language.
2.4.1 Reader
#lang racket (require "parser.rkt" ) (provide (rename-out [my-read read] [my-read-syntax read-syntax])) (define (my-read in) (syntax->datum (my-read-syntax #f in))) (define (my-read-syntax path port) (datum->syntax #f `(module rpcalc-mod racket ,(parse port))))
2.4.2 lex-yacc-example/rpcalc
Project directory should be lex-yacc-example/
Run "raco pkg install" in this directory, that will enable lex-yacc-example language’s used in examples below.
The file "rpcalc.rkt" in the top-level directory enables #lang lex-yacc-example/rpcalc
#lang racket (module reader racket (require "rpcalc/reader.rkt") (provide read read-syntax))
2.4.3 Main
Optionally, going down to directory rpcalc, and running "raco pkg install" will enable #lang rpcalc
The file "main.rkt":
#lang racket (module reader racket (require "reader.rkt") (provide read read-syntax))
2.4.4 Running rpcalc
File "rpcalc-test.rkt":
#lang lex-yacc-example/rpcalc 2 3 4 5 + + ^ n
> (require lex-yacc-example/rpcalc/rpcalc-test) -4096
3 Infix Notation Calculator
Based on GNU Bison Infix Calc example.
https://www.gnu.org/software/bison/manual/bison.html#Infix-Calc
We now modify rpcalc to handle infix operators instead of postfix. Infix notation involves the concept of operator precedence and the need for parentheses nested to arbitrary depth.
3.1 BNF
| ‹input› | ::= | ‹input› ‹line› |
| ‹line› | ::= | \n |
|
| | | ‹exp› \n |
| ‹exp› | ::= | ‹number› |
|
| | | ‹exp› + ‹exp› |
|
| | | ‹exp› - ‹exp› |
|
| | | ‹exp› * ‹exp› |
|
| | | ‹exp› / ‹exp› |
|
| | | ‹exp› ^ ‹exp› |
|
| | | - ‹exp› |
|
| | | ( ‹exp› ) |
3.2 Lexer
Below is the full code for the lexer.
#lang racket (require parser-tools/lex (prefix-in : parser-tools/lex-sre)) (define-tokens value-tokens (NUMBER)) (define-empty-tokens op-tokens (EOF ADD SUBTRACT PRODUCT DIVISION POWER NEG OP CP NEWLINE)) (define next-token (lexer-src-pos [(eof) (token-EOF)] [(:+ (:& (:~ #\newline) whitespace)) (return-without-pos (next-token input-port))] [#\+ (token-ADD)] [#\- (token-SUBTRACT)] [#\* (token-PRODUCT)] [#\/ (token-DIVISION)] [#\^ (token-POWER)] ;[#\n (token-NEG)] [#\( (token-OP)] [#\) (token-CP)] [#\newline (token-NEWLINE)] [(:: (:+ numeric) (:* (:: #\. (:+ numeric) ))) (token-NUMBER (string->number lexeme))])) (provide value-tokens op-tokens next-token)
Changes from rpcalc: Newline is a token, whitespace without newline, and token for ’(’ and ’)’. Neg will not be used in lexer, but is defined because of use in parser later on.
3.3 Parser
Below is the full code for the parser.
#lang racket (require parser-tools/yacc "lexer.rkt") (define myparser (parser (start input) (end EOF) (tokens value-tokens op-tokens ) (src-pos) (error (lambda (a b c d e) (begin (printf "a = ~a\nb = ~a\nc = ~a\nd = ~a\ne = ~a\n" a b c d e) (void)))) (precs (left ADD SUBTRACT) (left PRODUCT DIVISION) (nonassoc NEG) (right POWER)) (grammar (input [() '()] [(input line) (append $1 $2)]) (line [(NEWLINE) '()] [(exp NEWLINE) (list $1)]) (exp [(NUMBER) $1] [(exp ADD exp) (+ $1 $3)] [(exp SUBTRACT exp) (- $1 $3)] [(exp PRODUCT exp) (* $1 $3)] [(exp DIVISION exp) (/ $1 $3)] [(SUBTRACT exp) (prec NEG) (- $2)] [(exp POWER exp) (expt $1 $3)] [(OP exp CP) $2])))) (define (parse ip) (port-count-lines! ip) (myparser (lambda () (next-token ip)))) (provide parse )
There are two important new features shown in this code.
In the precs section, left declares token kinds and says they are left-associative operators. And right (right associativity).
Operator precedence is determined by the line ordering of the declarations. The higher the line number of the declaration (lower on the page or screen), the higher the precedence. Hence, exponentiation has the highest precedence, unary minus (NEG) is next, followed by ‘*’ and ‘/’, and so on. Unary minus is not associative, only precedence matters.
The other important new feature is the prec in the grammar section for the unary minus operator. The prec simply instructs Yacc that the rule (SUBTRACT exp) has the same precedence as NEG. In this case the next-to-highest.
3.3.1 Testing the parser
> (parse (open-input-string "\n\n1 + 4*8 \n 6/10\n\n\n 5 + 6 +7 \n\n\n")) '(33 3/5 18)
> (parse (open-input-string "\n\n(1 + 4)*8 \n 6/10\n\n\n 5 + 6 +7 \n\n\n")) '(40 3/5 18)
> (parse (open-input-string "1 + 2^4 \n")) '(17)
3.4 Language
We wrap it up with making a calc language.
3.4.1 Reader
#lang racket (require "parser.rkt" ) (provide (rename-out [my-read read] [my-read-syntax read-syntax])) (define (my-read in) (syntax->datum (my-read-syntax #f in))) (define (my-read-syntax path port) (datum->syntax #f `(module rpcalc-mod racket ',(parse port))))
Note the quote at: ',(parse port)
3.4.2 Main
#lang racket (module reader racket (require "reader.rkt") (provide read read-syntax))
3.4.3 Running calc
#lang lex-yacc-example/calc 1 + 4 * 8 (1 + 4) * 8 6/10 2 ^ 4 + 100
The result should be '(33 40 3/5 116).
4 Multi-Function Calculator
Based on GNU Bison Multi-Function Calc example.
https://www.gnu.org/software/bison/manual/bison.html#Multi_002dfunction-Calc
Now that the basics of lexer and yacc have been discussed, it is time to move on to a more advanced problem. The above calculators provided only five functions, ‘+’, ‘-’, ‘*’, ‘/’ and ‘^’. It would be nice to have a calculator that provides other mathematical functions such as sin, cos, etc.
function_name (argument)
4.1 BNF
| ‹input› | ::= | ‹input› ‹line› |
| ‹line› | ::= | \n |
|
| | | ‹exp› \n |
|
| | | ‹var› = ‹exp› \n |
| ‹exp› | ::= | ‹number› |
|
| | | ‹var› |
|
| | | ‹fun› ( ‹exp› ) |
|
| | | ‹exp› + ‹exp› |
|
| | | ‹exp› - ‹exp› |
|
| | | ‹exp› * ‹exp› |
|
| | | ‹exp› / ‹exp› |
|
| | | ‹exp› ^ ‹exp› |
|
| | | - ‹exp› |
|
| | | ( ‹exp› ) |
4.2 Lexer
Below is the full code for the lexer.
#lang racket (require parser-tools/lex (prefix-in : parser-tools/lex-sre) "funs.rkt") (define-tokens value-tokens (NUMBER VAR FUN)) (define-empty-tokens op-tokens (EOF ADD SUBTRACT PRODUCT DIVISION POWER OP CP EQ NEG NEWLINE )) (define next-token (lexer-src-pos [(eof) (token-EOF)] [(:+ (:& (:~ #\newline) whitespace)) (return-without-pos (next-token input-port))] [#\+ (token-ADD)] [#\- (token-SUBTRACT)] [#\* (token-PRODUCT)] [#\/ (token-DIVISION)] [#\^ (token-POWER)] [#\( (token-OP)] [#\) (token-CP)] [#\= (token-EQ)] [#\newline (token-NEWLINE)] [(:: (:+ numeric) (:* (:: #\. (:+ numeric) ))) (token-NUMBER (string->number lexeme))] [(:: alphabetic (:* (:or alphabetic numeric))) (let ([sym (string->symbol lexeme)]) (if (fun? sym) (token-FUN sym) (token-VAR sym)))])) (provide value-tokens op-tokens next-token)
The lexer has to decide between FUN or VAR token. This is done by query if function is defined. The function fun? gives the answer. More about function and variable in next section.
4.3 funs.rkt
The new file where function and variable is handled.
#lang racket (provide get-fun fun? get-var set!-var ) (define funs (hasheq 'atan atan 'cos cos 'exp expt 'ln log 'sin sin 'sqrt sqrt)) (define vars (make-hash)) (define (fun? key) (hash-has-key? funs key)) (define (var? key) (hash-has-key? vars key)) (define (get-fun key) (if (fun? key) (hash-ref funs key) (error "fun: no such function. " key))) (define (get-var key) (if (var? key) (hash-ref vars key) (error "var: no such variable. " key))) (define (set!-var key val) (hash-set! vars key val))
Function is stored in immutable hash-table.
Variable is stored in mutable hash-table, the memory where variable’s is defined and changed.
4.4 Parser
Below is the full code for the parser.
#lang racket (require parser-tools/yacc "lexer.rkt") (define myparser (parser (start input) (end EOF) (tokens value-tokens op-tokens ) (src-pos) (error (lambda (a b c d e) (begin (printf "a = ~a\nb = ~a\nc = ~a\nd = ~a\ne = ~a\n" a b c d e) (void)))) (precs (left ADD SUBTRACT) (left PRODUCT DIVISION) (nonassoc NEG) (left POWER)) (grammar (input [() '()] [(input line) (append $1 $2)]) (line [(NEWLINE) '()] [(exp NEWLINE) (list $1)] [(VAR EQ exp NEWLINE) `((assign ',$1 ,$3))]) (exp [(NUMBER) $1] [(VAR) `(var ',$1)] [(FUN OP exp CP) `(fun ',$1 ,$3)] [(exp ADD exp) `(add ,$1 ,$3)] [(exp SUBTRACT exp) `(subtract ,$1 ,$3)] [(exp PRODUCT exp) `(product ,$1 ,$3)] [(exp DIVISION exp) `(division ,$1 ,$3)] [(SUBTRACT exp) (prec NEG) `(neg ,$2)] [(exp POWER exp) `(power ,$1 ,$3)] [(OP exp CP) $2])))) (define (parse ip) (port-count-lines! ip) (myparser (lambda () (next-token ip)))) (provide parse )
The parser generate s-expression’s for input to the expander. This is an important step forward. When program’s becomes more complicated it is better to handle this separate from the parser. In Racket it is common that parsing is in the reader and runnable code is in the expander.
4.5 Language
We will making a mfcalc language, using reader and expander.
4.5.1 Expander
#lang racket (require (for-syntax syntax/parse) "funs.rkt") (provide (rename-out [module-begin #%module-begin]) #%top-interaction #%app #%datum quote add subtract product division power neg fun assign var) (define-syntax (module-begin stx) (syntax-parse stx [(module-begin expr ...) #'(#%module-begin expr ...)])) (define (add x y) (+ x y)) (define (subtract x y) (- x y)) (define (product x y) (* x y)) (define (division x y) (/ x y)) (define (power x y) (expt x y)) (define (neg x) (- x)) (define (fun name x) ((get-fun name) x)) (define (assign varname value) (begin (set!-var varname value) value)) (define (var name) (get-var name))
4.5.2 Testing s-exp
The s-exp version of mfcalc can be tested by using the #lang s-exp declaration
#lang s-exp "expander.rkt" (add 1 2) (subtract (add 3 4) 5) (product (power 2 4) 10) (division (product (power 2 4) 10) 5) (product 100 (fun 'sqrt 25)) (assign 'x 50) (division (var 'x) (add 3 7)) (assign 'y 20) (add (var 'x) (var 'y)) (assign 'x 60) (add (var 'x) (var 'y))
> (require lex-yacc-example/mfcalc/s-exp-test)
3
2
160
32
500
50
5
20
70
60
80
4.5.3 Reader
The reader module are using "expander.rkt".
#lang racket (require "parser.rkt" ) (provide (rename-out [my-read read] [my-read-syntax read-syntax])) (define (my-read in) (syntax->datum (my-read-syntax #f in))) (define (my-read-syntax path port) (datum->syntax #f `(module rpcalc-mod "expander.rkt" ,@(parse port))))
4.5.4 Main
#lang racket (module reader racket (require "reader.rkt") (provide read read-syntax ))
4.5.5 Running mfcalc
File "mfcalc-test.rkt":
#lang lex-yacc-example/mfcalc 1 + 2 3 * 4 + 10 3 * (4 + 10) 100 + 2 * 5 -5 + 3.5 2^4 / 2 sqrt(25) sin( 0.5) sqrt(16) + sqrt(25) x = 10 + 5 y = 2.5 x * 1000 x - y ln(6)
> (require lex-yacc-example/mfcalc/mfcalc-test)
3
22
42
110
-1.5
8
5
0.479425538604203
9
15
2.5
15000
12.5
1.791759469228055
5 Conclusion
Congratulation! You are now a lexer and yacc ninja!
Some other sources:
The GNU Bison manual covers many topics of yacc/bison parser. Useful even if you can not program in C.
https://www.gnu.org/software/bison/manual/bison.html
The racket parser have two examples: calc.rkt and read.rkt
https://github.com/racket/parser-tools/tree/master/parser-tools-lib/parser-tools/examples