12 Regular Expressions
(require scramble/regexp) | package: scramble-lib |
Added in version 0.3 of package scramble-lib.
This module works with the following S-expression representation of regular expressions. All literals in the grammar are recognized as symbols, not by binding.
RE | = | RE-id | ||
| | (or RE ...+) ; like <RE>|<RE> | |||
| | (cat RE ...) ; like <RE><RE> | |||
| | (repeat RE) ; like <RE>* | |||
| | (repeat RE n) ; like <RE>{n} | |||
| | (repeat RE m n) ; like <RE>{m,n} | |||
| | (* RE) ; like <RE>* | |||
| | (+ RE) ; like <RE>+ | |||
| | (? RE) ; like <RE>? | |||
| | (report RE) ; like (<RE>) | |||
| | (any) ; like . | |||
| | ^ ; like ^ | |||
| | $ ; like $ | |||
| | (mode modes-string RE) ; like (?<modes>:<RE>) | |||
| | (test tst RE) ; like (?<tst><RE>) | |||
| | (test tst RE RE) ; like (?<tst><RE>|<RE>) | |||
| | (unicode prop-string) ; like \p{<prop>} | |||
| | (unicode (not prop-string)) ; like \P{<prop>} | |||
| | (chars CharSet ...) ; like [<CharSet>] | |||
| | Look | |||
| | literal-string | |||
| | (inject pregexp-string) | |||
CharSet | = | (union CharSet ...) | ||
| | (intersect CharSet ...) | |||
| | (complement CharSet ...) | |||
| | chars-string | |||
| | CharRange ; eg, [#\A #\Z] | |||
| | char/integer ; eg, #\A, 65 | |||
| | RE-id ; if value is CharSet | |||
| | posix-charset-id ; eg, alpha, space | |||
CharRange | = | [lo:char/integer hi:char/integer] | ||
Test | = | Look | ||
| | (matched? n) | |||
Look | = | (look RE) ; like (?=<RE>) | ||
| | (look (not RE)) ; like (?!<RE>) | |||
| | (look-back RE) ; like (?<=<RE>) | |||
| | (look-back (not RE)) ; like (?<!<RE>) |
Changed in version 0.5 of package scramble-lib: Added any.
The forms of RE should mostly be self-explanatory, but a few of them deserve additional comments:
RE-id If RE-id was defined using define-RE, then its RE value is inserted in place of RE-id; otherwise, a syntax error is raised.
If an RE-id is defined with the same name as one of the unparenthesized RE forms (namely, ^ or $) or one of the POSIX character classes (eg, alpha), then the RE-id takes precedence.
(repeat RE m n) Matches RE between m and n times (inclusive), where m must be a natural number and n must be a natural number or +inf.0.
(repeat RE n) is equivalent to (repeat RE n n)
(* RE) and (repeat RE) are both equivalent to (repeat RE 0 +inf.0)
(+ RE) is equivalent to (repeat RE 1 +inf.0)
(? RE) is equivalent to (repeat RE 0 1)
(chars CharSet ...) Interprets (union CharSet ...) as a set of characters. The resulting set of characters must be non-empty; otherwise, a syntax error is raised. Generation of the pregexp literal depends on only the set of characters, not how it was originally expressed.
chars-string Represents the set of characters appearing in the string. No character in the string is interpreted specially. For example, - represents the character #\-; it is not interpreted as a range.
Note that a RE literal-string is treated differently.
literal-string A string RE is treated as the concatenation (cat) of singleton character sets that matches exactly that string. Special characters in the string are escaped when the pregexp is generated. For example:
> (px "[ab]*z?") #px"\\[ab\\]\\*z\\?"
> (regexp-match-exact? (px "[ab]*z?") "[ab]*z?") #t
Note that a CharSet chars-string is treated differently.
(inject pregexp-string) Injects the given pregexp-string into the generated output. It is treated as having lowest precedence, so it will be wrapped if it occurs within a higher-precedence operator. For example:
> (px (* (inject "[ab]"))) #px"(?:[ab])*"
syntax
(px maybe-mode part-RE ...)
maybe-mode =
| #:byte
Changed in version 0.5 of package scramble-lib: Added #:byte mode.
syntax
(rx part-RE ...)
maybe-mode =
| #:byte
> (rx (cat "A" (or "BB" "CCC"))) #rx"A(?:BB|CCC)"
> (rx (repeat (or "a" "b") 2 5)) #rx"[ab][ab](?:(?:(?:[ab])?[ab])?[ab])?"
> (rx (repeat (report "a") 2 5)) eval:9:0: rx: cannot handle report inside of repeat with
custom bounds
in: (rx (repeat (report "a") 2 5))
> (rx (repeat (report "a") 1 +inf.0)) #rx"(a)+"
> (rx (+ (chars alpha digit))) #rx"[0-9A-Za-z]+"
Changed in version 0.5 of package scramble-lib: Added #:byte mode.
syntax
(define-RE name maybe-mode rhs-RE)
maybe-mode =
| #:byte
If name is used as an expression, it expands to rhs-RE’s corresponding regular-expression literal. If the #:byte option is present, then a byte-pregexp literal is produced; otherwise, a pregexp literal is produced. The mode declaration does not affect uses of name within other RE forms.
> (define-RE As (* "A")) > As #px"A*"
> (define-RE BBs #:byte (* "BB")) > BBs #px#"(?:BB)*"
> (px (or As BBs)) #px"A*|(?:BB)*"
Changed in version 0.5 of package scramble-lib: Added #:byte mode.