12.2.1 Regexp Patterns

8.18.0.9

12.2.1 Regexp Patterns🔗ℹ

See also Regexp Quick Reference.

The portion of a rx or rx_in form within '…' is a pattern that is written with regexp pattern operators. Some pattern operators overlap with expression operators, but they have different meanings and precedence in a pattern. For example, the pattern operator * creates a repetition pattern, instead of multiplying like the expression * operator.

space

The space for pattern operators that can be used within rx and rx_in forms.

regexp operator

#%literal string

regexp operator

#%literal bytes

~stronger_than: ~other

A literal string or byte string can be used as a pattern. It matches the string’s characters or bytes literally. See also case_insensitive.

> rx'"hello"'.match("hello")
RXMatch("hello", [], {})
> rx'"hello"'.match("olleh")
#false
> rx'#"a"'.match(#"a")
RXMatch(Bytes.copy(#"a"), [], {})

regexp operator

pat #%juxtapose pat

regexp operator

pat ++ pat

regexp operator

pat #%call (pat)

~order: rx_concatenation

Patterns that are adjacent in a larger pattern match in sequence. The ++ operator can be used to make sequencing explicit. An implicit #%call form is treated like #%juxtapose, consistent with implicit uses of parentheses for grouping as handled by #%parens.

> rx'"hello" " " "world"'.match("hello world")
RXMatch("hello world", [], {})
> rx'"hello" ++ " " ++ "world"'.match("hello world")
RXMatch("hello world", [], {})
> rx'"hello"
++ " "
++ "world"'.match("hello world")
RXMatch("hello world", [], {})

regexp operator

pat || pat

~order: rx_disjunction

Matches as either the first pat or second pat. The first pat is tried first.

> rx'"a" || "b"'.match("a")
RXMatch("a", [], {})
> rx'"a" || "b"'.match("b")
RXMatch("b", [], {})
> rx'"a" || "b"'.match("c")
#false

regexp operator

#%parens (pat)

~order: rx_concatenation

A parenthesized pattern is equivalent to the pat inside the parentheses. That is, parentheses are just for grouping and resolving precedence mismatches. See $ for information about capture groups, which are not implicitly created by parentheses (as they are in some traditional regexp languages).

> rx'"a" || "b" ++ "c"'.match("ac")
#false
> rx'("a" || "b") ++ "c"'.match("ac")
RXMatch("ac", [], {})

regexp operator

#%brackets [charset]

regexp operator

pat #%index [charset]

~order: rx_concatenation

A […] pattern, which is an implicit use of #%brackets, matches a single character or byte, where charset determines the matching characters or bytes. An implicit #%index form (see Implicit Forms) is treated as a sequence of a pat and #%brackets.

See Regexp Character Sets for character set forms that can be used in charset.

> rx'["a"-"z"]'.match("m")
RXMatch("m", [], {})
> rx'["a"-"z"]'.match("0")
#false

regexp operator

pat *

regexp operator

pat * mode

~order: rx_repetition

mode

~greedy

~nongreedy

~possessive

Matches a sequence of 0 or more matches to pat.

> rx'any*'.match("abc")
RXMatch("abc", [], {})
> rx'any*'.match("")
RXMatch("", [], {})

By default, the match uses ~greedy mode, where a larger number of matches is tried first—but subsequent patterns may cause backtracking to a shorter match. In ~nongreedy mode, shorter matches are tried first. The ~possessive mode is like ~greedy, but without backtracking (i.e., the longest match must succeed overall for the enclosing pattern); see also cut.

> rx'($head: any*) ($tail: any*)'.match("abc")
RXMatch("abc", ["abc", ""], {#'head: 1, #'tail: 2})
> rx'($head: any* ~nongreedy) ($tail: any*)'.match("abc")
RXMatch("abc", ["", "abc"], {#'head: 1, #'tail: 2})

> rx'any* ~greedy "z"'.match("abcz")
RXMatch("abcz", [], {})
> rx'any* ~possessive "z"'.match("abcz")
#false

regexp operator

pat +

regexp operator

pat + mode

~order: rx_repetition

Like *, but matches 1 or more instances of pat.

> rx'any+'.match("abc")
RXMatch("abc", [], {})
> rx'any+'.match("")
#false

regexp operator

pat ?

regexp operator

pat ? mode

~order: rx_repetition

Similar to *, but matches 0 or 1 instances of pat.

> rx'any?'.match("a")
RXMatch("a", [], {})
> rx'any?'.match("")
RXMatch("", [], {})
> rx'any?'.match("abc")
#false

regexp operator

pat #%comp {count}

regexp operator

pat #%comp {min ..}

regexp operator

pat #%comp {min ..= max}

~order: rx_repetition

Using {…} after a pattern, which is use of the implicit #%comp form, specifies a repetition like * or + more generally. If a single count is provided, it specifies an exact number of repetitions. If just min is provided, then it specifies a minimum number of repetitions, and there is no maximum. Finally, min and max both can be specified. Write 0 ..= max to provide only an upper bound. Note that the expression form ..= max creates a range that starts at #neginf, and the intent of requiring a min for a regexp repetition is to avoid suggesting that negative counts are possible. A count, min, or max must be a literal nonnegative integer.

> rx'any{2}'.match("aa")
RXMatch("aa", [], {})
> rx'any{2}'.match("aaa")
#false

> rx'any{2..}'.match("aa")
RXMatch("aa", [], {})
> rx'any{2..}'.match("aaa")
RXMatch("aaa", [], {})

> rx'any{2..=3}'.match("aa")
RXMatch("aa", [], {})
> rx'any{2..=3}'.match("aaa")
RXMatch("aaa", [], {})
> rx'any{2..=3}'.match("aaaa")
#false

regexp operator

..=

Only allowed within a {…} repetition form.

regexp operator

any

regexp operator

char

regexp operator

byte

Matches a single character or byte. The . pattern matches any character or byte except a newline, while any also matches a newline. The char and byte forms are like any and also imply that that the enclosing regexp matches strings or byte strings, respectively.

> rx'.'.match("a")
RXMatch("a", [], {})
> rx'.'.match("\n")
#false
> rx'any'.match("\n")
RXMatch("\n", [], {})

> rx'char'.match("\n")
RXMatch("\n", [], {})
> rx'byte'.match("\n")
RXMatch(Bytes.copy(#"\n"), [], {})

regexp operator

.* mode

regexp operator

.+ mode

regexp operator

.? mode

Equivalent to . *, . +, and . ?, but allowing the space between the operators to be omitted.

> rx'.*'.match("abc")
RXMatch("abc", [], {})

regexp operator

bof

regexp operator

bol

Matches the start of input with bof or the position after a newline with bol.

A regexp created with rx (as opposed to rx_in) is implicitly prefixed with bof for use with methods like Regexp.match (as opposed to Regexp.match_in).

> rx'bof "a"'.match_in("a")
RXMatch("a", [], {})
> rx'bol "a"'.match_in("x\na")
RXMatch("a", [], {})
> rx'bof "a"'.match_in("x\na")
#false

regexp operator

eof

regexp operator

eol

Matches the end of input with eof or the position before a newline with eol.

A regexp created with rx (as opposed to rx_in) is implicitly suffixed with eof for use with methods like Regexp.match (as opposed to Regexp.match_in).

> rx'"a" eof'.match_in("a")
RXMatch("a", [], {})
> rx'"a" eol'.match_in("a\nx")
RXMatch("a", [], {})
> rx'"a" eof'.match_in("a\nx")
#false

regexp operator

$ identifier: pat

regexp operator

$ identifier

regexp operator

$ int

regexp operator

$ expr

The $ operator is overloaded for related uses:

When followed by an identifier and a : for a block containing pat, $ creates a capture group. The portion of input that is matched against pat is recorded and associated with the name identifier. If the enclosing pattern uses pat zero or multiple times, then identifier is associated to #false if the pattern is used zero times, or it is associated to the latest match if used multiple times.
> rx'any ($m: any)'.match("ab")
RXMatch("ab", ["b"], {#'m: 1})
> rx'any ($m: any)'.match("ab")[#'m]
"b"
> rx'any ($m: any)*'.match("a")
RXMatch("a", [#false], {#'m: 1})
> def rx'any ($m: any)' = "ab"
> m
"b"
When followed by an identifier and no subsequent block, then $ is either a backreference to a named capture group, or it is a splice of a regexp that is bound to identifier.
The use of $ forms a backreference if identifier is associated to a capture group anywhere in the enclosing pattern; the backreference matches input that is the same as the most recent match for the capture group (and never matches if the capture group does not yet have a match).
> rx'any ($m: any) $m'.match("abb")
RXMatch("abb", ["b"], {#'m: 1})
> rx'any ($m: any) $m'.match("abc")
#false
When $ forms a splice, then a regular expression is formed dynamically by merging the referenced regexp into the enclosing pattern. (A limitation: both the merged regexp and enclosing pattern must be free of backreferences, because backreferences need to be converted from names to absolute positions eagerly.)
fun labeled(key) :: RX:
rx'$key ": " $name: .*'
> labeled(rx'"fruit"').match("fruit: apple")
RXMatch("fruit: apple", ["apple"], {#'name: 1})
> labeled(rx'"veggie"').match("veggie: carrot")
RXMatch("veggie: carrot", ["carrot"], {#'name: 1})
When followed by a literal integer, then $ forms a backreference that refers to a capture group by index instead of by name. Capture groups are numbered from 1, since 0 is reserved to refer to the entire match.
> rx'any ($m: any) $1'.match("abb")
RXMatch("abb", ["b"], {#'m: 1})
> rx'any ($m: any) $1'.match("abc")
#false
When followed by an expression other than an identifier or literal integer, then $ always forms a splice.

regexp operator

~~ pat

Matches pat as an unnamed capture group. The capture group’s match can only be referenced by index (counting from 1).

> rx'any ~~any any*'.match("abc")[1]
"b"
> rx'any ~~any $1'.match("abb")
RXMatch("abb", ["b"], {})

regexp operator

lookahead(pat)

regexp operator

lookbehind(pat)

regexp operator

! lookahead(pat)

regexp operator

! lookbehind(pat)

Matches an empty position in the input where the subsequent (for lookahead) or preceding (for lookbehind) input matches pat—or does not match, when a ! prefix is used.

> rx'. "a" lookahead("p")'.match_in("cat nap")
RXMatch("na", [], {})
> rx'. "a" !lookahead("t")'.match_in("cat nap")
RXMatch("na", [], {})
> rx'lookbehind("n") "a" .'.match_in("cat nap")
RXMatch("ap", [], {})
> rx'!lookbehind("c") "a" .'.match_in("cat nap")
RXMatch("ap", [], {})

regexp operator

word_boundary

regexp operator

word_continue

Matches an empty position in the input. The word_boundary pattern matches between an alphanumeric ASCII character (a-z, A-Z, or 0-9) or _ and another character that is not alphanumeric or _. The word_continue pattern matches positions that do not match word_boundary.

> rx'any+ ~nongreedy word_boundary'.match_in("cat nap")
RXMatch("cat", [], {})
> rx'any+ ~nongreedy word_continue'.match_in("cat nap")
RXMatch("c", [], {})

regexp operator

if lookahead(pat) | then_pat | else_pat

regexp operator

if lookbehind(pat) | then_pat | else_pat

regexp operator

if ! lookahead(pat) | then_pat | else_pat

regexp operator

if ! lookbehind(pat) | then_pat | else_pat

regexp operator

if $ identifier | then_pat | else_pat

regexp operator

if $ int | then_pat | else_pat

Matches as then_pat or else_pat, depending on the form immediately after if, which must be either a lookahead, lookbehind, or backreference pattern.

> rx'($x: "x")* if $x | "s" | "."'.match_in("xxxs")
RXMatch("xxxs", ["x"], {#'x: 1})
> rx'($x: "x")* if $x | "s" | "."'.match_in(".")
RXMatch(".", [#false], {#'x: 1})

regexp operator

cut

Matches an empty position in the input. The first potential match that reaches cut is the only one that is allowed to succeed. Note that a possessive repetition mode like * ~possessive is equivalent to using cut after the repetition.

In the case of a rx_in pattern or use of RX.match_in, cut applies only to a match attempt at a given input position. It does not prevent trying the match at a later position.

> rx'("ax" || "a") cut "x"'.match("ax")
#false
> rx'("a" || "ax") cut "x"'.match("ax")
RXMatch("ax", [], {})

regexp operator

bytes: pat

regexp operator

string: pat

Matches he same as pat, but specifies explicitly either byte-string mode or string mode.

> rx'string: "a"'.match("a")
RXMatch("a", [], {})
> rx'bytes: "a"'.match("a")
RXMatch(Bytes.copy(#"a"), [], {})
> rx'string: any'.match(#"\x80")
#false
> rx'bytes: any'.match(#"\x80")
RXMatch(Bytes.copy(#"\200"), [], {})

regexp operator

case_sensitive: pat

regexp operator

case_insensitive: pat

Adjusts the treatment of literal strings and ranges in pat to match case-sensitive (the default) or case-insensitive. In case-insensitive mode, characters are folded individually (as opposed for folding a string sequence, which can change its length).

> rx'"hello"'.match("HELLO")
#false
> rx'case_insensitive: "hello"'.match("HELLO")
RXMatch("HELLO", [], {})

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

regexp operator

Each of these names is bound both as a character set and as a pattern that can be used directly, instead of wrapping in […]. See the alpha, etc., character set for more information.

> rx'alpha'.match("m")
RXMatch("m", [], {})
> rx'alpha'.match("0")
#false

operator order

operator_order.def rx_repetition

operator order

operator_order.def rx_subtraction

operator order

operator_order.def rx_enumeration

operator order

operator_order.def rx_conjunction:

~weaker_than:

rx_repetition

rx_enumeration

operator order

operator_order.def rx_disjunction:

~weaker_than:

rx_conjunction

rx_repetition

rx_enumeration

operator order

operator_order.def rx_concatenation:

~weaker_than:

~other

~stronger_than:

rx_conjunction

rx_disjunction

Operator orders for regexp and character set operators.

value

def rx_meta.space :: SpaceMeta

Provided as meta.

A compile-time value that identifies the same space as rx. See also SpaceMeta.

definition

rx.macro macro_patterns

Like expr.macro, but defines a new regexp operator.

rx.macro 'upto_e($(n :: Int))':
  let n = n.unwrap()
  if n == 1
  | 'digit'
  | '["1"-"9"] digit{$(n-1)} || upto_e($(n-1))'
rx.macro 'pct':
  '("100" || upto_e(2)) "%"'

> rx'pct "/" pct "/" pct'.is_match("1%/42%/100%")
#true

syntax class

syntax_class rx_meta.Parsed

syntax class

syntax_class rx_meta.AfterPrefixParsed(name :: Name)

syntax class

syntax_class rx_meta.AfterInfixParsed(name :: Name)

Provided as meta.

Analogous to expr_meta.Parsed, etc., but for regexp patterns.

1	Notation and Conventions
2	Implicits and Context
3	Names and Definitions
4	Functions and Operators
5	Comparison and Branching
6	Objects and Annotations
7	Basic Data
8	Collections and Iteration
9	Object Protocols
10	Higher-Order Control
11	Code as Data
12	String Formatting and Matching
13	Input and Output
14	Operating System
15	Threads and Concurrency
16	Reflection and Security
17	Runtime System

12.1	String Formatting
12.2	Regular Expressions

12.2.1	Regexp Patterns
12.2.2	Regexp Character Sets
12.2.3	Regexp Objects
12.2.4	Regexp Match Results