On this page:
px
rx
define-RE
8.16.0.4

12 Regular Expressions🔗ℹ

 (require scramble/regexp) package: scramble-lib

Added in version 0.3 of package scramble-lib.

This module works with the following S-expression representation of regular expressions. All literals in the grammar are recognized as symbols, not by binding.

  RE = RE-id
  | (or RE ...+)                 ; like <RE>|<RE>
  | (cat RE ...)                 ; like <RE><RE>
  | (repeat RE)                  ; like <RE>*
  | (repeat RE n)                ; like <RE>{n}
  | (repeat RE m n)              ; like <RE>{m,n}
  | (* RE)                       ; like <RE>*
  | (+ RE)                       ; like <RE>+
  | (? RE)                       ; like <RE>?
  | (report RE)                  ; like (<RE>)
  | (any)                        ; like .
  | ^                            ; like ^
  | $                            ; like $
  | (mode modes-string RE)       ; like (?<modes>:<RE>)
  | (test tst RE)                ; like (?<tst><RE>)
  | (test tst RE RE)             ; like (?<tst><RE>|<RE>)
  | (unicode prop-string)        ; like \p{<prop>}
  | (unicode (not prop-string))  ; like \P{<prop>}
  | (chars CharSet ...)          ; like [<CharSet>]
  | Look
  | literal-string
  | (inject pregexp-string)
     
  CharSet = (union CharSet ...)
  | (intersect CharSet ...)
  | (complement CharSet ...)
  | chars-string
  | CharRange                    ; eg, [#\A #\Z]
  | char/integer                 ; eg, #\A, 65
  | RE-id                        ; if value is CharSet
  | posix-charset-id             ; eg, alpha, space
     
  CharRange = [lo:char/integer hi:char/integer]
     
  Test = Look
  | (matched? n)
     
  Look = (look RE)                    ; like (?=<RE>)
  | (look (not RE))              ; like (?!<RE>)
  | (look-back RE)               ; like (?<=<RE>)
  | (look-back (not RE))         ; like (?<!<RE>)

Changed in version 0.5 of package scramble-lib: Added any.

The forms of RE should mostly be self-explanatory, but a few of them deserve additional comments:

RE-id

If RE-id was defined using define-RE, then its RE value is inserted in place of RE-id; otherwise, a syntax error is raised.

If an RE-id is defined with the same name as one of the unparenthesized RE forms (namely, ^ or $) or one of the POSIX character classes (eg, alpha), then the RE-id takes precedence.

(repeat RE m n)

Matches RE between m and n times (inclusive), where m must be a natural number and n must be a natural number or +inf.0.

  • (repeat RE n) is equivalent to (repeat RE n n)

  • (* RE) and (repeat RE) are both equivalent to (repeat RE 0 +inf.0)

  • (+ RE) is equivalent to (repeat RE 1 +inf.0)

  • (? RE) is equivalent to (repeat RE 0 1)

(chars CharSet ...)

Interprets (union CharSet ...) as a set of characters. The resulting set of characters must be non-empty; otherwise, a syntax error is raised. Generation of the pregexp literal depends on only the set of characters, not how it was originally expressed.

chars-string

Represents the set of characters appearing in the string. No character in the string is interpreted specially. For example, - represents the character #\-; it is not interpreted as a range.

Note that a RE literal-string is treated differently.

literal-string

A string RE is treated as the concatenation (cat) of singleton character sets that matches exactly that string. Special characters in the string are escaped when the pregexp is generated. For example:
> (px "[ab]*z?")

#px"\\[ab\\]\\*z\\?"

> (regexp-match-exact? (px "[ab]*z?") "[ab]*z?")

#t

Note that a CharSet chars-string is treated differently.

(inject pregexp-string)

Injects the given pregexp-string into the generated output. It is treated as having lowest precedence, so it will be wrapped if it occurs within a higher-precedence operator. For example:
> (px (* (inject "[ab]")))

#px"(?:[ab])*"

syntax

(px maybe-mode part-RE ...)

 
maybe-mode = 
  | #:byte
Converts the RE formed by (cat part-RE ...) into a regexp value. If the #:byte keyword is used, then the literal is created with byte-pregexp; otherwise, it is created with pregexp.

The generation of the pregexp literal takes precedence into account and inserts (?:_) wrappers as necessary. For example:
> (px (cat "A" (or "BB" "CCC")))

#px"A(?:BB|CCC)"

> (px #:byte (repeat "BB" 3))

#px#"(?:BB){3}"

Changed in version 0.5 of package scramble-lib: Added #:byte mode.

syntax

(rx part-RE ...)

 
maybe-mode = 
  | #:byte
Like px, but produces a regexp literal instead. Not all RE features can be expressed as a regexp-style regular expression. For example, a repeat RE with custom bounds cannot be expressed if it contains a report sub-RE. If such a feature is used, a syntax error is raised. A syntax error is also raised if a character set has a range endpoint that is a special character such as #\- or #\]; it is possible to express such character sets in regexp-style regular expressions, but this library currently does not support it.

> (rx (cat "A" (or "BB" "CCC")))

#rx"A(?:BB|CCC)"

> (rx (repeat (or "a" "b") 2 5))

#rx"[ab][ab](?:(?:(?:[ab])?[ab])?[ab])?"

> (rx (repeat (report "a") 2 5))

eval:9:0: rx: cannot handle report inside of repeat with

custom bounds

  in: (rx (repeat (report "a") 2 5))

> (rx (repeat (report "a") 1 +inf.0))

#rx"(a)+"

> (rx (+ (chars alpha digit)))

#rx"[0-9A-Za-z]+"

Changed in version 0.5 of package scramble-lib: Added #:byte mode.

}

syntax

(define-RE name maybe-mode rhs-RE)

 
maybe-mode = 
  | #:byte
Defines name as a name bound to a compile-time regular expression; name can be used in RE forms as an abbreviation to stand for rhs-RE.

If name is used as an expression, it expands to rhs-RE’s corresponding regular-expression literal. If the #:byte option is present, then a byte-pregexp literal is produced; otherwise, a pregexp literal is produced. The mode declaration does not affect uses of name within other RE forms.

Examples:
> (define-RE As (* "A"))
> As

#px"A*"

> (define-RE BBs #:byte (* "BB"))
> BBs

#px#"(?:BB)*"

> (px (or As BBs))

#px"A*|(?:BB)*"

Changed in version 0.5 of package scramble-lib: Added #:byte mode.