3 Token Parsing
The tokens used for grouping and indentation are distinct from other categories:
( ) [ ] { } ' ; , : | « » \ |
Other tokens are described by the grammar below, where a star (★) in the left column indicates the productions that correspond to terms or comments.
Numbers are supported directly in simple forms—
Boolean literals are #true and #false. The void value is #void.
Identifiers are formed from Unicode alphanumeric characters plus _ and emoji sequences, where the initial character must not be a numeric character (unless that numeric character starts an emoji sequence, as in 1 followed by U+FE0F and U+20E3). An identifier can also be prefixed with #%; such identifiers are intended for use for “internal” names that are not normally visible. An identifier prefixed with ~ (and without #%) forms a keyword, analogous to prefixing an identifier with #: in Racket.
Operators are formed from Unicode symbolic and punctuation characters other than the ones listed above as distinct tokens (plus a few more, like ", ', and single-character emoji sequences), but | or : is also allowed in an operator name as long as it is not by itself, and some # combinations like #' and #, are also operators. A multi-character operator cannot end in :, since that creates an ambiguity with an operator just before a block, except that a sequence containing only : is allowed. A multi-character operator can end with / only when followed by a character other than / or *, and an operator cannot contain // or /*; those constraints avoid ambiguities with comments.
Implicit in the grammar is the usual convention of choosing the largest possible match at the start of a stream. Not reflected in the grammar is a set of delimiter requirements: numbers, #true, and #false must be followed by a delimiter. For example, 1x is a lexical error, because the x after 1 is not a delimiter. Non-alphanumeric characters other than _ are delimiters.
Certain ambiguities related to number and operator parsing are resolved by special rules. A number ends with a trailing . only if the . cannot be treated as the start of a multi-character operator; also, a . that is not part of a multi-character operator cannot appear after a number. The + and - characters as a number prefix versus an operator are similarly treated as part of a multi-character operator when possible, and they subject to one additional rule: they are parsed as a single-character operator when immediately preceded by an alphanumeric character, _, ., ), ], or } with no whitespace in between. For example, 1+2 is 1 plus 2, but 1 +2 is 1 followed by the number +2.
When a #{…} escape describes an identifier S-expression, it is an identifier in the same sense as a shrubbery-notation identifier. The same holds for numbers, booleans, strings, byte strings, and keywords. A #{…} escape must not describe a pair, because pairs are used to represent a parsed shrubbery, and allowing pairs would create ambiguous or ill-formed representations.
Lines and indentation-influencing whitespace are not represented as tokens. Instead, each token conceptually has a line and column derived from its position in the input sequence of characters. The line for an input sequence increments at a linefeed character (code point 0x0A), a two-character sequence of return (code point 0x0C) and linefeed, or a return character that is not followed by a linefeed character. The column of an input sequence for measuring indentation increments once per Unicode grapheme cluster, except that tabs are treated specially.Note that the use of grapheme clusters is a different counting of columns than built into a Racket or Rhombus input port, which counts by Unicode code points. More generally, a column corresponds to a sequence of spaces and tabs, where all non-tab grapheme clusters are treated like a space. A column is more indented than another only if it extends the other column’s sequence. When neither of two columns is a prefix of the other, then the columns are incomparable; if parsing depends on an order between incomparable columns, then it fails with a “mix tabs” error.
For more details on @ parsing, see At-Notation Parsing, but the table below describes the shape of @ forms.
★ | ‹identifier› | ::= | ‹plainident› | |
| | #% ‹plainident› | |||
| ||||
★ | ‹plainident› | ::= | ‹alpha› ‹alphanum›* | |
| ||||
‹alpha› | ::= | alphabetic Unicode character or _ | ||
| | Unicode emoji sequence | |||
| ||||
‹alphanum› | ::= | ‹alpha› | ||
| | numeric Unicode character | |||
| ||||
★ | ‹keyword› | ::= | ~ ‹plainident› | |
| ||||
★ | ‹operator› | ::= | ‹opchar›* ‹tailopchar› | not |, :, ~, ... |
| | : :+ | ... or containing // ... | ||
| | # ‹hashopchar› | ... or containing /* | ||
| ||||
‹opchar› | ::= | symbolic Unicode character not in ‹special› | ||
| | punctuation Unicode character not in ‹special› | |||
| | one of : | | |||
| ||||
‹tailopchar› | ::= | anything in ‹opchar› except : | not / followed by / or * | |
| ||||
‹hashopchar› | ::= | one of ', ,, ;, :, | | ||
| ||||
‹special› | ::= | one of (, ), [, ], {, }, ', «, » | ||
| | one of ", ;, ,, #, \, _, @ | |||
| | single-character Unicode emoji sequence | |||
| ||||
★ | ‹number› | ::= | ‹integer› | |
| | ‹float› | |||
| | ‹hexinteger› | |||
| | ‹octalinteger› | |||
| | ‹binaryinteger› | |||
| | ‹fraction› | |||
| ||||
‹integer› | ::= | ‹sign›? ‹nonneg› | ||
| ||||
‹sign› | ::= | one of + or - | ||
| ||||
‹nonneg› | ::= | ‹decimal› ‹usdecimal›+ | ||
| ||||
‹decimal› | ::= | one of 0 through 9 | ||
| ||||
‹usdecimal› | ::= | ‹decimal› | ||
| | _ ‹decimal› | |||
| ||||
‹float› | ::= | ‹sign›? ‹nonneg› . ‹nonneg›? ‹exp›? | ||
| | ‹sign›? . ‹nonneg› ‹exp›? | |||
| | ‹sign›? ‹nonneg› ‹exp› | |||
| | #inf | |||
| | #neginf | |||
| | #nan | |||
| ||||
‹exp› | ::= | e ‹sign›? ‹nonneg› | ||
| | E ‹sign›? ‹nonneg› | |||
| ||||
‹hexinteger› | ::= | ‹sign›? 0x ‹hex› ‹ushex›* | ||
| ||||
‹hex› | ::= | one of 0 through 9 | ||
| | one of a through f | |||
| | one of A through F | |||
| ||||
‹ushex› | ::= | ‹hex› | ||
| | _ ‹hex› | |||
| ||||
‹octalinteger› | ::= | ‹sign›? 0o ‹octal› ‹usoctal›* | ||
| ||||
‹octal› | ::= | one of 0 through 7 | ||
| ||||
‹usoctal› | ::= | ‹octal› | ||
| | _ ‹octal› | |||
| ||||
‹binaryinteger› | ::= | ‹sign›? 0b ‹bit› ‹usbit›* | ||
| ||||
‹bit› | ::= | one of 0 or 1 | ||
| ||||
‹usbit› | ::= | ‹bit› | ||
| | _ ‹bit› | |||
| ||||
‹fraction› | ::= | ‹integer› / ‹nonneg› | ‹nonneg› not 0 | |
| ||||
★ | ‹boolean› | ::= | #true | |
| | #false | |||
| ||||
★ | ‹void› | ::= | #void | |
| ||||
★ | ‹string› | ::= | " ‹strelem›* " | |
| ||||
‹strelem› | ::= | like Racket, but no literal newline | \U ≤ 6 digits | |
| ||||
★ | ‹bytestring› | ::= | #" ‹bytestrelem›* " | |
| ||||
‹bytestrelem› | ::= | like Racket, but no literal newline | ||
| ||||
★ | ‹sexpression› | ::= | #{ ‹racket› } | |
| ||||
‹racket› | ::= | any non-pair Racket S-expression | ||
| ||||
★ | ‹comment› | ::= | // ‹nonnlchar›* | |
| | /* ‹anychar›* */ | nesting allowed | ||
| | @// ‹nonnlchar›* | only within ‹text› | ||
| | @// ‹atopen› ‹anychar›* ‹atopen› | only within ‹text› | ||
| | #! ‹nonnlchar›* ‹continue›* | |||
| ||||
‹nonnlchar› | ::= | any character other than newline | ||
| ||||
‹continue› | ::= | \ ‹nonnlchar›* | ||
| ||||
★ | ‹atexpression› | ::= | @ ‹command› ‹arguments›? ‹text›* | no space between parts |
| | @ ‹text›* | no space between parts | ||
| | @ ‹splice› | no space between parts | ||
| ||||
‹command› | ::= | ‹prefix›* ‹identifier› | no space between parts | |
| | ‹keyword› | |||
| | ‹operator› | |||
| | ‹number› | |||
| | ‹boolean› | |||
| | ‹string› | |||
| | ‹bytestring› | |||
| | ‹racket› | |||
| | ( ‹group›* ) | usual ,-separated | ||
| | [ ‹group›* ] | usual ,-separated | ||
| | « ‹group› » | |||
| ||||
‹splice› | | | (« ‹group› ») | ||
| ||||
‹prefix› | ::= | ‹identifier› ‹operator› | no space between parts | |
| ||||
‹arguments› | ::= | ( ‹group›* ) | optional ,-separated | |
| ||||
‹text› | ::= | ‹atopen› ‹text› ‹atclose› | escapes in ‹text› | |
| ||||
‹atopen› | ::= | { | ||
| | | ‹asciisym›* { | |||
| ||||
‹atclose› | ::= | } | ||
| | } ‹asciisym›* | | flips opener chars |