amb-parser: Parser generator for ambiguous grammars
#lang amb-parser | package: amb-parser |
1 Introduction
This is a very simple parser generator that mimics brag (and actually its built on top of it). It is designed for parsing English text.
Ambiguous grammar support. Parser generates all possible derivation trees.
Support for tokens that have multiple types attached. For example, the word cook in English may be either a verb or a noun.
I don’t know whether the implementation works correctly.
2 Parser language
Semicolon is required at the end of a production.
Terminals are written in lowercase, non-terminals are in uppercase.
Note: the parser does not support left-recursion. (Yet Another Parser that Doesn’t Support Left Recursion, yapdslr).
#lang amb-parser S : NP VP; NP : determiner? adjective* noun; VP : verb NP;
- Rule combinations:
S : NP VP;
- Optional rule:
NP : determiner? noun;
- Kleene star rule:
NP : adjective* noun;
- Alteration on the top level of an expansion:
A : b | c;
Note: the parser does not support grouping.
Note: you are allowed to combine question and star operators only in the order: rule*?.
2.1 Formal parser language definition
#lang brag parser-syntax : rule+ rule : NON-TERMINAL COLON alteration SEMICOLON alteration : expansion (ALTERATION expansion)* expansion : (non-terminal | terminal)* non-terminal : NON-TERMINAL [STAR] [QUESTION] terminal : TERMINAL [STAR] [QUESTION]
3 Parser interface
Firstly, import amb-parser in your Racket module.
- Then create a file with
#lang amb-parser
line and with your grammar and require it in your Racket module.
You will be provided with the parse function. You can use this function to parse your text. But before, your text should be separated into tokens.
For more details, refer to the parse documentation.
4 Usage example
#lang amb-parser S : NP VP; NP : determiner? adjective* noun; VP : verb NP;
(list (token "the" '(determiner)) (token "big" '(adjective)) (token "cat" '(noun)) (token "catched" '(verb)) (token "a" '(determiner)) (token "small" '(adjective)) (token "grey" '(adjective)) (token "mouse" '(noun)))
(list (parser-result '(S (NP (determiner "the") (adjective* (adjective "big") (adjective*)) (noun "cat")) (VP (verb "catched") (NP (determiner "a") (adjective* (adjective "small") (adjective* (adjective "grey") (adjective*))) (noun "mouse")))) '()))
Note how the star rule works.
5 Parse tree structure
In this section we will describe the structure of the parse tree without specifying that the parser returns multiple results and every parser-result contains the rest of tokens.
Every rule name : expansion; will be parsed as (name expansion).
Every expansion rule1 rule2 will be parsed as an association list, where keys and values are organized as list with two elements, keys are rule names and values are the result of parsing the rules.
Every terminal terminal will be parsed as (list ’terminal token-str) where token-str is the string of the matched token.
If some parsing is failed then the parse returns empty (meaning there is no derivation trees and no parser results).
6 Types and functions
6.1 Tokens
pos is a list of all possible parts of speech for this token.
When parser tries to parse a terminal rule it checks whether the terminal is a member of pos list.
Note: in fact, the implementation does not really utilizes the info that the str is a string?.
6.2 Parser result
struct
(struct parser-result (data rest) #:transparent) data : any/c rest : (listof token?)
procedure
(parser-result? x) → boolean?
x : any/c
procedure
(parser-result-data res) → any/c
res : parser-result?
procedure
(parser-result-rest res) → (listof token?)
res : parser-result?
6.3 Parsing
procedure
(parse non-terminal tokens) → (listof parser-result?)
non-terminal : symbol? tokens : (listof token?)
Parser returns a list of all possible derivation trees. The results are organized into parser-result structures.
WARNING: There is a bad design desicion. The parse function requires a list of tokens, instead of a lambda function that generates tokens on demand.