12.2.2 Regexp Character Sets

8.18.0.1

12.2.2 Regexp Character Sets🔗ℹ

See also Regexp Quick Reference.

A character set is written with […] in a regexp pattern (via the implicit #%brackets operator). A character set represents a set of Chars, but as long as the characters range in Unicode value from 0 to 255, a character set can be used as a set of bytes to match for a byte-mode regexp.

space

rx_charset

The space for character set operators that can be used within […] in a regexp pattern.

regexp charset operator

#%literal string

regexp charset operator

#%literal bytes

A literal string or byte string can be used as a character set. Each character or byte is part of the set.

> rx'["a"]'.is_match("a")
#true
> rx'["a"]'.is_match("b")
#false
> rx'["abc"]'.is_match("b")
#true

regexp charset operator

charset #%juxtapose charset

regexp charset operator

charset || charset

regexp charset operator

charset #%call (charset)

~order: rx_concatenation

Character sets that are adjacent or joined with || form a larger character set that includes all combined elements, i.e., a union of the sets. An implicit #%call form is treated like #%juxtapose, consistent with implicit uses of parentheses for grouping as handled by #%parens.

> rx'["a" "b"]'.is_match("a")
#true
> rx'["a" "b"]'.is_match("b")
#true
> rx'["a" "b"]'.is_match("c")
#false
> rx'["a" || "b"]'.is_match("b")
#true

regexp charset operator

#%parens (charset)

~order: rx_concatenation

A parenthesized character set is equivalent to the charset inside the parentheses. That is, parentheses are just for grouping and resolving precedence mismatches.

> rx'["a" "b" "c"]'.is_match("a")
#true
> rx'[("a" "b") "c"]'.is_match("a")
#true

regexp charset operator

charset - charset

~order: rx_enumeration

Assuming that each charset contains a single character, creates a charset that has those two characters and all characters in between (based on Char.to_int values). An error is reported if either charset has zero or multiple characters.

> rx'["a" - "y"]'.is_match("a")
#true
> rx'["a" - "y"]'.is_match("x")
#true
> rx'["a" - "y"]'.is_match("z")
#false

regexp charset operator

charset && charset

~order: rx_conjunction

Creates a character set that has each character in both the first charset and the second charset, i.e., an intersection of the sets.

> rx'[("a" - "f") && ("c" - "h")]'.is_match("a")
#false
> rx'[("a" - "f") && ("c" - "h")]'.is_match("d")
#true

regexp charset operator

charset -- charset

~order: rx_subtraction

Creates a character set that starts with the character of the first charset and removes each character of the second charset, i.e., set difference.

> rx'[("a" - "z") -- ("m" - "p")]'.is_match("n")
#false
> rx'[("a" - "z") -- ("m" - "p")]'.is_match("a")
#true

regexp charset operator

! charset

~weaker_than: ~other

Inverts charset by creating a character set that has every character not in charset.

> rx'[! "a" - "z"]'.is_match("n")
#false
> rx'[! "a" - "z"]'.is_match("0")
#true

regexp charset operator

any

A character set that has all characters.

> rx'[any]'.is_match("a")
#true

regexp charset operator

alpha

regexp charset operator

upper

regexp charset operator

lower

The alpha character set has all ASCII letters: a-z and A-Z. The upper character set has just A-Z, while the lower character set has just a-z.

> rx'[alpha]'.is_match("a")
#true
> rx'[alpha]'.is_match("0")
#false
> rx'[alpha]'.is_match("λ")
#false

> rx'[upper]'.is_match("A")
#true
> rx'[upper]'.is_match("a")
#false

> rx'[lower]'.is_match("a")
#true
> rx'[lower]'.is_match("A")
#false

regexp charset operator

digit

regexp charset operator

xdigit

The digit character set has all ASCII digits: 0-9. The xdigit character set adds the remaining hexadecimal digits: a-f, and A-F.

> rx'[digit]'.is_match("0")
#true
> rx'[digit]'.is_match("a")
#false

> rx'[xdigit]'.is_match("0")
#true
> rx'[xdigit]'.is_match("a")
#true
> rx'[xdigit]'.is_match("z")
#false

regexp charset operator

alnum

regexp charset operator

word

The alnum character set has all ASCII letters and digits: 0-9, a-z, and A-Z. The word character set adds _.

> rx'[alnum]'.is_match("0")
#true
> rx'[alnum]'.is_match("z")
#true
> rx'[alnum]'.is_match("_")
#false
> rx'[word]'.is_match("_")
#true

regexp charset operator

newline

regexp charset operator

blank

regexp charset operator

space

The newline character set has just the newline character (Char.to_int value 10). The blank character set has space (Char.to_int value 32) and tab (Char.to_int value 7). The space character set combines those and adds return (Char.to_int value 10) and form feed (Char.to_int value 12).

> rx'[blank]'.is_match(" ")
#true

regexp charset operator

graph

regexp charset operator

The graph character set has all ASCII characters that print with ink. The print character set adds space (Char.to_int value 32) and tab (Char.to_int value 7).

regexp charset operator

cntrl

All ASCII control characters (Char.to_int values 0 through 31).

> rx'[cntrl]'.is_match("\n")
#true
> rx'[cntrl]'.is_match("a")
#false

regexp charset operator

ascii

regexp charset operator

latin1

The ascii character set has all ASCII characters (Char.to_int values 0 through 127), and the latin1 character set has all Latin-1 characters (Char.to_int 0 through 255).

> rx'[ascii]'.is_match("a")
#true
> rx'[ascii]'.is_match("é")
#false
> rx'[latin1]'.is_match("é")
#true
> rx'[latin1]'.is_match("λ")
#false

regexp charset operator

unicode.Ll

regexp charset operator

unicode.Lu

regexp charset operator

unicode.Lt

regexp charset operator

unicode.Lm

regexp charset operator

unicode.Lx

regexp charset operator

unicode.Lo

regexp charset operator

unicode.L

regexp charset operator

unicode.Nd

regexp charset operator

unicode.Nl

regexp charset operator

unicode.No

regexp charset operator

unicode.N

regexp charset operator

unicode.Ps

regexp charset operator

unicode.Pe

regexp charset operator

unicode.Pi

regexp charset operator

unicode.Pf

regexp charset operator

unicode.Pc

regexp charset operator

unicode.Pd

regexp charset operator

unicode.Po

regexp charset operator

unicode.P

regexp charset operator

unicode.Mn

regexp charset operator

unicode.Mc

regexp charset operator

unicode.Me

regexp charset operator

unicode.M

regexp charset operator

unicode.Sc

regexp charset operator

unicode.Sk

regexp charset operator

unicode.Sm

regexp charset operator

unicode.So

regexp charset operator

unicode.S

regexp charset operator

unicode.Zl

regexp charset operator

unicode.Zp

regexp charset operator

unicode.Zs

regexp charset operator

unicode.Z

regexp charset operator

unicode.Cc

regexp charset operator

unicode.Cf

regexp charset operator

unicode.Cs

regexp charset operator

unicode.Cn

regexp charset operator

unicode.Co

regexp charset operator

unicode.C

Each of these character sets contains all Unicode characters that have the named general category, such as Ll for lowercase letters. Each single-letter name, such as unicode.L, unions all of the other general categories that start with the same letter. The unicode.Lx character set unions unicode.Ll, unicode.Lu, unicode.Lt, and unicode.Lm.

> rx'[unicode.Ll]'.is_match("λ")
#true

value

def rx_charset_meta.space :: SpaceMeta

Provided as meta.

A compile-time value that identifies the same space as rx_charset. See also SpaceMeta.

definition

rx_charset.macro macro_patterns

Like expr.macro, but defines a character set operator.

rx_charset.macro 'octal': '"0"-"7"'
rx_charset.macro 'maybe $charset': '$charset "?"'

> rx'[maybe(octal) "!"]*'.match("3?!4")
RXMatch("3?!4", [], {})
> rx'[maybe(octal)]'.match("8")
#false

syntax class

syntax_class rx_charset_meta.Parsed

syntax class

syntax_class rx_charset_meta.AfterPrefixParsed(name :: Name)

syntax class

syntax_class rx_charset_meta.AfterInfixParsed(name :: Name)

Provided as meta.

Analogous to expr_meta.Parsed, etc., but for regexp character ranges.

1	Notation and Conventions
2	Implicits and Context
3	Names and Definitions
4	Functions and Operators
5	Comparison and Branching
6	Objects and Annotations
7	Basic Data
8	Collections and Iteration
9	Object Protocols
10	Higher-Order Control
11	Code as Data
12	String Formatting and Matching
13	Input and Output
14	Operating System
15	Threads and Concurrency
16	Reflection and Security
17	Runtime System

12.1	String Formatting
12.2	Regular Expressions

12.2.1	Regexp Patterns
12.2.2	Regexp Character Sets
12.2.3	Regexp Objects
12.2.4	Regexp Match Results