fuzzy-search
This module provides approximate string matching procedures loosely based on Sublime Text’s approach.
1 Definitions
In the context of this collection, the following terms apply:
needle: A user-defined string used as an input search pattern.
haystack: A string that may contain needles.
match: A position in a haystack where a needle’s character appears.
score: A user-defined exact integer representing how “well” a needle matches a haystack.
scoring procedure: A procedure that returns a score.
2 Approximate Matching
(require fuzzy-search) | package: fuzzy-search |
procedure
(fuzzy-search needle haystack [ assign-score #:max-match-count max-match-count #:max-depth max-depth #:case-sensitive? case-sensitive?])
→
boolean? exact-integer?
(hash/c exact-nonnegative-integer? exact-nonnegative-integer? #:immutable #t) needle : string? haystack : string? assign-score : score/c = forrest-skinner max-match-count : exact-positive-integer? = 256 max-depth : exact-positive-integer? = 10 case-sensitive? : any/c = #f
#t if a match was found.
The value of (assign-score H haystack (string-length haystack) (length (hash-keys H))), where H is the returned hash table.
A hash table H, such that (string-ref haystack (hash-ref H N)) returns the first character of the Nth (zero-based) match. This hash table will produce the highest score for the given haystack.
The search will record a match for every contiguous character in needle found in haystack. If the number of matches exceeds max-match-count, then the search will conclude as if no match occurred, and with a score of 0.
Recursive searches in substrings of haystack will commence in order to find higher scoring matches. max-depth limits the recursion depth of the search. While searches are themselves called in tail position, max-depth can still help reduce the runtime of the search.
When case-sensitive? is #f, characters from the needle and haystack are first normalized with char-downcase before being compared.
value
score/c :
(-> (hash/c exact-nonnegative-integer? exact-nonnegative-integer? #:immutable #t) string? exact-positive-integer? exact-positive-integer? exact-integer?)
This procedure accepts the following arguments, in order:
An immutable hasheq. The meaning of the hash is the same as it is in fuzzy-search.
A haystack.
The length of the haystack.
The number of matches found in the haystack.
Like score/c, but accepts two more arguments that represent a key and the associated value from the hash table. This allows per-occurrence scoring when used with score/match.
value
value
(define score/forrest (score/all (λ _ 100) (score/all/leading-letter -5 -15) (score/all/unmatched-letter -1) (score/match (score/match/sequence 15) (score/match/first-letter 15) (score/match/pair (score/pair/camel-case 30) (score/pair/prefixed 30 '(#\space #\_))))))
syntax
(score-if t v)
(λ args (for/sum ([f (in-list procs)]) (apply f args)))
Useful for combining several score/c procedures.
procedure
v : exact-integer?
procedure
(score/all/leading-letter v limit) → score/c
v : exact-integer? limit : exact-positive-integer?
Use to base a penalty or reward on how far the first match is from the beginning of its haystack.
procedure
(score/match procedures ...) → score/c
procedures : score/match/c
procedure
(score/match/sequence v ...) → score/match/c
v : exact-integer?
The scoring procedure assigns a score of v if a given match (other than the first) starts immediately after the prior match in their haystack.
procedure
v : exact-integer?
That procedure returns a score of v if the given match starts at the beginning of the haystack.
procedure
(score/match/pair procedures ...) → score/match/c
procedures : score/pair/c
The returned procedure is also meant for use in score/match.
procedure
v : exact-integer?
That procedure returns a score of v if a given pair of characters changes from lowercase to uppercase. It returns a score of 0 otherwise.
procedure
(score/pair/prefixed v chars) → score/pair/c
v : exact-integer? chars : (listof char?)
That procedure returns a score of v if the first of a given pair of characters appears in chars. It returns a score of 0 otherwise.
3 Fuzzy Requires
(require fuzzy-search/require) | package: fuzzy-search |
syntax
(fuzzy needle ...)
The filesystem search entails use of find-files starting from the directory of the module using fuzzy, or (current-directory) if the source directory is unknown. Paths relative to the file using fuzzy are then treated as haystacks for fuzzy-search.
This is useful if you are prototyping a Racket package and are likely to shuffle files around. Normally when you move a file, all relative path require subforms will break. fuzzy forms may break. A break would occur if a needle targeting a file becomes ambiguous, or if the intended file moves out of the search scope.
(require fuzzy-search/require (fuzzy "legacy" ; legacy.rkt "widgets")) ; utils/widgets.rkt
Do not use fuzzy in user-facing code. Use it only to delay your commitment to concrete file paths. Even then, be careful not to use overly vague needles. (require (fuzzy "a")) is a dice roll.
4 Credits and Project Information
The implementation is based on Forrest Smith’s work with Sublime Text’s fuzzy search. That link will take you to his content and public domain source code. Thanks, Forrest!
You can find the source code of this package here.
My own contributions are as follows:
Port the JavaScript implementation of Forrest’s fuzzy match to Racket.
Refactor the code to allow scoring procedures and combinators as part of the public API.
Make case-sensitivity an option.
Expose match positions in program output.
Package for distribution on the Racket catalog.
Add the fuzzy require subform.
If you found my additions useful, then please consider sponsoring my work. I count on your support to continue writing libraries like this one.