2.2 Token Parsing

8.18.0.10

2.2 Token Parsing🔗ℹ

The tokens used for grouping and indentation are distinct from other categories:

( ) [ ] { } ' ; , : | « » \

Other tokens are described by the grammar below, where a star (★) in the left column indicates the productions that correspond to terms or comments.

Numbers are supported directly in simple forms—decimal integers, decimal floating point, hexadecimal/octal/binary integers, and fractions—in all cases allowing _s between digits. A #{…} escape provides access to the full Racket S-expression number grammar. Special floating-point values use a # notation: #inf, #neginf, and #nan.

Boolean literals are #true and #false. The void value is #void.

Identifiers are formed from Unicode alphanumeric characters plus _ and emoji sequences, where the initial character must not be a numeric character (unless that numeric character starts an emoji sequence, as in 1 followed by U+FE0F and U+20E3). An identifier can also be prefixed with #%; such identifiers are intended for use for “internal” names that are not normally visible. An identifier prefixed with ~ (and without #%) forms a keyword, analogous to prefixing an identifier with #: in Racket.

Operators are formed from Unicode symbolic and punctuation characters other than the ones listed above as distinct tokens (plus a few more, like ", ', and single-character emoji sequences), but | or : is also allowed in an operator name as long as it is not by itself, and some # combinations like #' and #, are also operators. A multi-character operator cannot end in :, since that creates an ambiguity with an operator just before a block, except that a sequence containing only : is allowed. A multi-character operator can end with / only when followed by a character other than / or *, and an operator cannot contain // or /*; those constraints avoid ambiguities with comments.

Implicit in the grammar is the usual convention of choosing the largest possible match at the start of a stream. Not reflected in the grammar is a set of delimiter requirements: numbers, #true, and #false must be followed by a delimiter. For example, 1x is a lexical error, because the x after 1 is not a delimiter. Non-alphanumeric characters other than _ are delimiters.

Certain ambiguities related to number and operator parsing are resolved by special rules. A number ends with a trailing . only if the . cannot be treated as the start of a multi-character operator; also, a . that is not part of a multi-character operator cannot appear after a number. The + and - characters as a number prefix versus an operator are similarly treated as part of a multi-character operator when possible, and they subject to one additional rule: they are parsed as a single-character operator when immediately preceded by an alphanumeric character, _, ., ), ], or } with no whitespace in between. For example, 1+2 is 1 plus 2, but 1 +2 is 1 followed by the number +2.

When a #{…} escape describes an identifier S-expression, it is an identifier in the same sense as a shrubbery-notation identifier. The same holds for numbers, booleans, strings, byte strings, and keywords. A #{…} escape must not describe a pair, because pairs are used to represent a parsed shrubbery, and allowing pairs would create ambiguous or ill-formed representations. The ~#{…} shorthand always produces a keyword, where the content of ~#{…} must be an S-expression identifier that is converted to a keyword.

Lines and indentation-influencing whitespace are not represented as tokens. Instead, each token conceptually has a line and column derived from its position in the input sequence of characters. The line for an input sequence increments at a linefeed character (code point 0x0A), a two-character sequence of return (code point 0x0C) and linefeed, or a return character that is not followed by a linefeed character. The column of an input sequence for measuring indentation increments once per Unicode grapheme cluster, except that tabs are treated specially.Note that the use of grapheme clusters is a different counting of columns than built into a Racket or Rhombus input port, which counts by Unicode code points. More generally, a column corresponds to a sequence of spaces and tabs, where all non-tab grapheme clusters are treated like a space. A column is more indented than another only if it extends the other column’s sequence. When neither of two columns is a prefix of the other, then the columns are incomparable; if parsing depends on an order between incomparable columns, then it fails with a “mix tabs” error.

For more details on @ parsing, see At-Notation Parsing, but the table below describes the shape of @ forms.

★	‹identifier›	::=	‹plainident›
		\|	#% ‹plainident›

★	‹plainident›	::=	‹alpha› ‹alphanum›*

	‹alpha›	::=	alphabetic Unicode character or _
		\|	Unicode emoji sequence

	‹alphanum›	::=	‹alpha›
		\|	numeric Unicode character

★	‹keyword›	::=	~ ‹plainident›

★	‹operator›	::=	‹opchar›* ‹tailopchar›	not \|, :, ~, ...
		\|	: :+	... or containing // ...
		\|	# ‹hashopchar›	... or containing /*

	‹opchar›	::=	symbolic Unicode character not in ‹special›
		\|	punctuation Unicode character not in ‹special›
		\|	one of : \|

	‹tailopchar›	::=	anything in ‹opchar› except :	not / followed by / or *

	‹hashopchar›	::=	one of ', ,, ;, :, \|

	‹special›	::=	one of (, ), [, ], {, }, ', «, »
		\|	one of ", ;, ,, #, \, _, @
		\|	single-character Unicode emoji sequence

★	‹number›	::=	‹integer›
		\|	‹float›
		\|	‹hexinteger›
		\|	‹octalinteger›
		\|	‹binaryinteger›
		\|	‹fraction›

	‹integer›	::=	‹sign›? ‹nonneg›

	‹sign›	::=	one of + or -

	‹nonneg›	::=	‹decimal› ‹usdecimal›+

	‹decimal›	::=	one of 0 through 9

	‹usdecimal›	::=	‹decimal›
		\|	_ ‹decimal›

	‹float›	::=	‹sign›? ‹nonneg› . ‹nonneg›? ‹exp›?
		\|	‹sign›? . ‹nonneg› ‹exp›?
		\|	‹sign›? ‹nonneg› ‹exp›
		\|	#inf
		\|	#neginf
		\|	#nan

	‹exp›	::=	e ‹sign›? ‹nonneg›
		\|	E ‹sign›? ‹nonneg›

	‹hexinteger›	::=	‹sign›? 0x ‹hex› ‹ushex›*

	‹hex›	::=	one of 0 through 9
		\|	one of a through f
		\|	one of A through F

	‹ushex›	::=	‹hex›
		\|	_ ‹hex›

	‹octalinteger›	::=	‹sign›? 0o ‹octal› ‹usoctal›*

	‹octal›	::=	one of 0 through 7

	‹usoctal›	::=	‹octal›
		\|	_ ‹octal›

	‹binaryinteger›	::=	‹sign›? 0b ‹bit› ‹usbit›*

	‹bit›	::=	one of 0 or 1

	‹usbit›	::=	‹bit›
		\|	_ ‹bit›

	‹fraction›	::=	‹integer› / ‹nonneg›	‹nonneg› not 0

★	‹boolean›	::=	#true
		\|	#false

★	‹void›	::=	#void

★	‹string›	::=	" ‹strelem›* "

	‹strelem›	::=	like Racket, but no literal newline	\U ≤ 6 digits

★	‹bytestring›	::=	#" ‹bytestrelem›* "

	‹bytestrelem›	::=	like Racket, but no literal newline

★	‹sexpression›	::=	#{ ‹racket› }
		::=	~#{ ‹racket-identifier› }

	‹racket›	::=	any non-pair Racket S-expression

★	‹comment›	::=	// ‹nonnlchar›*
		\|	/* ‹anychar›* */	nesting allowed
		\|	@// ‹nonnlchar›*	only within ‹text›
		\|	@// ‹atopen› ‹anychar›* ‹atopen›	only within ‹text›
		\|	#! ‹nonnlchar›* ‹continue›*

	‹nonnlchar›	::=	any character other than newline

	‹continue›	::=	\ ‹nonnlchar›*

★	‹atexpression›	::=	@ ‹command› ‹arguments›? ‹text›*	no space between parts
		\|	@ ‹text›*	no space between parts
		\|	@ ‹splice›	no space between parts

	‹command›	::=	‹prefix›* ‹identifier›	no space between parts
		\|	‹keyword›
		\|	‹operator›
		\|	‹number›
		\|	‹boolean›
		\|	‹string›
		\|	‹bytestring›
		\|	‹racket›
		\|	( ‹group›* )	usual ,-separated
		\|	[ ‹group›* ]	usual ,-separated
		\|	« ‹group› »

	‹splice›	\|	(« ‹group› »)

	‹prefix›	::=	‹identifier› ‹operator›	no space between parts

	‹arguments›	::=	( ‹group›* )	optional ,-separated

	‹text›	::=	‹atopen› ‹text› ‹atclose›	escapes in ‹text›

	‹atopen›	::=	{
		\|	\| ‹asciisym›* {

	‹atclose›	::=	}
		\|	} ‹asciisym›* \|	flips opener chars

1	Quick Overview
2	Shrubbery Specification
3	Parsed Representation
4	Shrubbery Language
5	Shrubbery APIs
6	Design Considerations
7	Editor Support