trie
(require trie) | package: trie |
1 Introduction
Tries are a data structure for efficiently storing collections of potentially-overlapping pieces of information. For example, instead of storing the strings ‘bi’, ‘bike’, ‘bicycle’, and ‘bicycles’, it would be better to store ‘bi’, ‘ke’, ‘cycle’, ‘s’ with appropriate markers for how to rebuild the words.
A trie is represented as a hash but should be treated as an opaque struct for purposes of manipulation.
2 Definitions
An item is one piece of data stored in the trie. Items are represented as lists. For example, '("/" "home" "dstorrs"). It is fine to have an item with length 1.
An element is one component of an item. In the item example above, "/" is an element, "home" is an element, and "dstorrs" is an element.
Each element is either terminal or non-terminal. Terminal elements represent the end of an item. This matters because it means that a trie might in fact store all elements of an entry yet if the final element is not marked terminal then the trie will answer ‘no’ when asked if it contains the entry. For example, if the trie contains the item '("/" "home" "dstorrs" "writing") it still does not contain the item '("/" "home" "dstorrs") unless that item has been added separately.
A trie-node is the information associated with an element in the trie. Trie-nodes track three pieces of information: whether their element is terminal, whatever user-supplied data might be attached, and a trie containing the children of this node. In the '("/" "home" "dstorrs") example, "dstorrs" is a child of "home" is a child of "/". The name of the field that tracks children is ‘kids’ because it’s shorter.
The data field of a trie-node is considered not set if it is equal? to the current value of the trie-node-default-data parameter.
Trie-nodes are structs created with the struct-plus-plus module. They therefore have keyword constructors with default values and contracts that validate the arguments, as well as dotted accessors (e.g. trie-node.data) in addition to the standard ones (e.g. trie-node-data). All of the normal struct function exist, but you are encouraged to work with the new versions.
When using trie-add-item+data! it will sometimes be the case that the item specifies information for a node that already exists. In this case the existing node is called current and the one that is being added is new.
Whenever it is necessary to combine two nodes, either current or new will have priority, meaning that its values will win in the case of conflict. An example from general Racket code is that when doing hash-union the hash that comes last in the list has priority because its fields will override those of earlier hashes.
3 Synopsis
> (require trie) > (define root (make-trie-root)) > (trie? root) #t
> (void (trie-add-item! root (explode-path "/home/dstorrs/writing/patchwork-realms"))) > (void (trie-add-item! root (explode-path "/"))) > (void (trie-add-item! root (explode-path "/home/dstorrs/writing/two-year-emperor"))) > (void (trie-add-item! root (explode-path "/home/dstorrs/taxes/2018.pdf"))) > (void (trie-add-item! root (explode-path "/home/dstorrs/todo.txt"))) > (void (trie-add-item! root (explode-path "/home/bob/grocery-list.txt"))) > (void (trie-add-item! root (explode-path "/bin/bash"))) > (trie-contains? root (explode-path "/")) #t
> (trie-contains? root (explode-path "/home/dstorrs/todo.txt")) #t
> (trie-contains? root (explode-path "/tmp/test.txt")) #f
; The trie contains the node "/" with child "home" with child ; "dstorrs" with child "todo.txt". The "dstorrs" node is not terminal ; because we never added "/home/dstorrs" and therefore '("/", "home", "dstorrs") ; is not considered to be in the trie. > (trie-contains? root (explode-path "/home/dstorrs")) #f
; We can say "if this thing is there but not terminal, update it to be terminal". ; This is NOT the same as ‘if it is not there, add it'. > (trie-contains?/update! root (explode-path "/home/dstorrs")) #t
; After the previous command "/home/dstorrs" is considered to be in the trie. > (trie-contains? root (explode-path "/home/dstorrs")) #t
> (trie-get-elements root) '(#<path:/>)
> (define result (trie-unroll root #:pre path->string #:combine (compose1 path->string cleanse-path (curryr string-join "/")) #:sort (curryr sort string<?) #:post (λ (l) (displayln "Trie contains the following paths: ") (pretty-print l))))
Trie contains the following paths:
'("/"
"/bin/bash"
"/home/bob/grocery-list.txt"
"/home/dstorrs"
"/home/dstorrs/taxes/2018.pdf"
"/home/dstorrs/todo.txt"
"/home/dstorrs/writing/patchwork-realms"
"/home/dstorrs/writing/two-year-emperor")
; The final result is whatever came back from the ‘post' function. > (displayln (~a "Result of the unroll was: " (~v result))) Result of the unroll was: #<void>
; If you don't specify keyword args then you get the data back as a list of lists in no particular order. > (pretty-print (trie-unroll root))
'((#<path:/>)
(#<path:/> #<path:home> #<path:dstorrs> #<path:todo.txt>)
(#<path:/> #<path:home> #<path:dstorrs>)
(#<path:/> #<path:bin> #<path:bash>)
(#<path:/> #<path:home> #<path:bob> #<path:grocery-list.txt>)
(#<path:/>
#<path:home>
#<path:dstorrs>
#<path:writing>
#<path:patchwork-realms>)
(#<path:/> #<path:home> #<path:dstorrs> #<path:taxes> #<path:2018.pdf>)
(#<path:/>
#<path:home>
#<path:dstorrs>
#<path:writing>
#<path:two-year-emperor>))
4 Tries and Data
Often, you want to store your dataset compactly but decorate parts of it with arbitrary data. For example, you might want to store the fact that "/home/dstorrs" is the user directory for David K. Storrs, that "/home/dstorrs/writing/two-year-emperor" is a novel and "/home/dstorrs/writing/baby-blues" is a novella. Additionally, you might want to add several overlapping items at the same time. trie-add-item+data! allows this.
> (require trie) > (define the-trie (make-trie-root))
> (trie-add-item+data! the-trie (list "/" (cons "home" (trie-node++ #:data 'all-users)) (cons "dstorrs" (trie-node++ #:terminal? #t #:data (hash 'user "David K. Storrs"))) "writing" (cons "two-year-emperor" (trie-node++ #:terminal? #t #:data (cons 'type 'novel))))) '#hash(("/" . #s((trie-node #(0 1 2)) #f #<void> #hash(("home" . #s((trie-node #(0 1 2)) #f all-users #hash(("dstorrs" . #s((trie-node #(0 1 2)) #t #hash((user . "David K. Storrs")) #hash(("writing" . #s((trie-node #(0 1 2)) #f #<void> #hash(("two-year-emperor" . #s((trie-node #(0 1 2)) #t (type . novel) #hash())))))))))))))))
> (trie-add-item+data! the-trie #:combine 'replace (list "/" (cons "home" (trie-node++ #:data 'all-users)) (cons "dstorrs" (trie-node++ #:terminal? #t #:data (hash 'user "David K. Storrs"))) "writing" (cons "two-year-emperor" (trie-node++ #:terminal? #t #:data (cons 'type 'novel))))) '#hash(("/" . #s((trie-node #(0 1 2)) #f #<void> #hash(("home" . #s((trie-node #(0 1 2)) #f all-users #hash(("dstorrs" . #s((trie-node #(0 1 2)) #t #hash((user . "David K. Storrs")) #hash(("writing" . #s((trie-node #(0 1 2)) #f #<void> #hash(("two-year-emperor" . #s((trie-node #(0 1 2)) #t (type . novel) #hash())))))))))))))))
5 API
parameter
(trie-node-default-data default) → void? default : any/c
= (void)
procedure
(trie-node++ [ #:terminal? terminal? #:data data #:kids kids]) → trie-node? terminal? : boolean? = #f data : any/c = (trie-node-default-data) kids : trie? = (make-trie-root)
procedure
(trie-node.terminal? node) → boolean?
node : trie-node? (trie-node.data node) → any/c node : trie-node? (trie-node.kids node) → trie? node : trie-node?
procedure
(make-trie-root) → trie?
procedure
(trie-is-empty? arg) → boolean?
arg : trie?
procedure
(clear-trie! arg) → trie?
arg : trie?
procedure
(trie-get-elements arg) → (listof any/c)
arg : trie?
procedure
(trie-add-item! arg item) → trie?
arg : trie? item : (listof any/c)
procedure
(trie-contains? arg item) → boolean?
arg : trie? item : (listof any/c)
procedure
(trie-contains?/update! arg item [ #:update? update?]) → boolean? arg : trie? item : (listof any/c) update? : boolean? = #t
procedure
(trie-add-item+data! arg the-item [ #:combine combine-method]) → trie? arg : trie? the-item : (listof any/c)
combine-method : (or/c 'keep 'replace 'meld/current 'meld/new (-> trie-node? trie-node? trie-node?)) = 'meld/new
If the last value in arg specifies a trie-node then that trie-node must have its terminal? value set to #t or an error will be thrown.
The important part is to understand what happens when a trie-node is given for an element that already exists in the trie–an example would be:
(define root (make-trie-root)) ; Add an item but do not put any data with the elements. This is effectively ; the same as using 'trie-add-item!' (trie-add-item+data! root '("/" "home" "dstorrs")) ; The trie contains: '(("/" "home" "dstorrs")) ; Add another item that overlaps with the previous one but does not specify extra information (trie-add-item+data! root '("/" "home" "dstorrs" "writing")) ; The trie contains: '(("/" "home" "dstorrs") ; ("/" "home" "dstorrs" "writing")) ; Add a third item that overlaps and has associated data. We want to annotate ; the fact that "/home" is a directory, not a file. (trie-add-item+data! root (list "/" (cons "home" (trie-node++ #:terminal? #t #:data 'dir)))) ; The trie contains: '(("/" "home" "dstorrs") ; ("/" "home" "dstorrs" "writing") ; ("/" "home")) ; In that last call we provided a trie-node of data that we want ; added to the "home" element. That element already exists and ; has a trie-node. Therefore, we must merge current and new.
When merging nodes:
If the combine-method argument is a procedure then the current and new nodes will be passed to it and the resulting trie-node will be used.
If the combine-method is not a procedure, then we first determine priority based on the value of combine-method:
'keep : current has priority.
'meld/current : current has priority.
'replace : new has priority.
'meld/new : new has priority.
After making this determination we update the fields of the existing node as follows:
- terminal?:
Set to #t if either of current or new is marked as terminal. This ensures that we don’t accidentally remove an item from the trie by setting its final element to be non-terminal.
- kids:
Take the union of the kids for the two nodes. The node with priority overrides conflicting elements in the other. WARNING: This can cause items to be deleted from the trie if they are overwritten by a new entry. In general, it’s better to leave the kids field blank when using trie-add-item+data!.
- data:
Under the 'keep or 'replace methods, use the value from the node that has priority.
- Under the 'meld/current or 'meld/new methods, combine the items as follows:
If the data fields in both nodes are the same, do nothing.
If neither node has its data field set, do nothing. (This is a special case of the prior rule.)
If exactly one of the nodes has its data field set, use that one.
If both values are hashes, take their union, allowing the node that has priority to overwrite conflicting entries from the other node.
Otherwise, return a list containing both values where the first element is from the node that has priority.
One edge case: ‘The data field is not set’ simply means that its value is equal? to the current value of the trie-node-default-data parameter. If the value of trie-node-default-data changed after the trie-node was created then the status of the node as un/set may be misidentified. (The parameter’s value might have changed either because it was assigned to or because you entered/left a parameterize block.)
procedure
(trie-unroll+data the-trie [ #:pre pre #:combine combine #:sort sort-func #:post post]) → any the-trie : trie? pre : (-> (cons/c any/c trie-node?) any/c) = identity combine : (-> list? any/c) = identity sort-func : (-> list? list?) = identity post : procedure? = identity
The pre function is called on each element as it is processed. It is passed a cons pair containing the element and a newly-created trie node that includes the data and terminal? values for that node.
The combine function is called on each item once it has been assembled.
The sort-func function is called on the list of items.
The post function is called on the sorted list of items.
The final result of trie-unroll+data is whatever comes back from the post function.
> (require trie) > (define root (make-trie-root))
> (trie-add-item+data! root (list (build-path "/") (build-path "home") (cons (build-path "dstorrs") (trie-node++ #:terminal? #t #:data (hash 'personal-name "David K. Storrs"))))) '#hash((#<path:/> . #s((trie-node #(0 1 2)) #f #<void> #hash((#<path:home> . #s((trie-node #(0 1 2)) #t #<void> #hash((#<path:dstorrs> . #s((trie-node #(0 1 2)) #t #hash((personal-name . "David K. Storrs")) #hash((#<path:writing> . #s((trie-node #(0 1 2)) #t #<void> #hash()))))))))))))
> (trie-add-item+data! root (explode-path "/home/dstorrs/writing")) '#hash((#<path:/> . #s((trie-node #(0 1 2)) #f #<void> #hash((#<path:home> . #s((trie-node #(0 1 2)) #t #<void> #hash((#<path:dstorrs> . #s((trie-node #(0 1 2)) #t #hash((personal-name . "David K. Storrs")) #hash((#<path:writing> . #s((trie-node #(0 1 2)) #t #<void> #hash()))))))))))))
> (trie-add-item+data! root (explode-path "/home")) '#hash((#<path:/> . #s((trie-node #(0 1 2)) #f #<void> #hash((#<path:home> . #s((trie-node #(0 1 2)) #t #<void> #hash((#<path:dstorrs> . #s((trie-node #(0 1 2)) #t #hash((personal-name . "David K. Storrs")) #hash((#<path:writing> . #s((trie-node #(0 1 2)) #t #<void> #hash()))))))))))))
> (pretty-print (trie-unroll+data root))
'(((#<path:/> . #s((trie-node #(0 1 2)) #f #<void> #hash()))
(#<path:home> . #s((trie-node #(0 1 2)) #t #<void> #hash()))
(#<path:dstorrs>
.
#s((trie-node #(0 1 2))
#t
#hash((personal-name . "David K. Storrs"))
#hash()))
(#<path:writing> . #s((trie-node #(0 1 2)) #t #<void> #hash())))
((#<path:/> . #s((trie-node #(0 1 2)) #f #<void> #hash()))
(#<path:home> . #s((trie-node #(0 1 2)) #t #<void> #hash()))
(#<path:dstorrs>
.
#s((trie-node #(0 1 2))
#t
#hash((personal-name . "David K. Storrs"))
#hash())))
((#<path:/> . #s((trie-node #(0 1 2)) #f #<void> #hash()))
(#<path:home> . #s((trie-node #(0 1 2)) #t #<void> #hash()))))
> (pretty-print (trie-unroll+data root #:pre car))
'((#<path:/> #<path:home> #<path:dstorrs>)
(#<path:/> #<path:home>)
(#<path:/> #<path:home> #<path:dstorrs> #<path:writing>))
> (pretty-print (trie-unroll+data root #:pre (compose1 path->string car))) '(("/" "home" "dstorrs" "writing") ("/" "home") ("/" "home" "dstorrs"))
> (pretty-print (trie-unroll+data root #:pre (λ (e) (cons (path->string (car e)) (cdr e))) #:combine (λ (the-item) (cons (path->string (cleanse-path (string-join (map car the-item) "/"))) (filter-not (curry equal? (trie-node-default-data)) (map (compose1 trie-node.data cdr) the-item)))) #:sort (curryr sort string<? #:key car) #:post (λ (items) (define (write-to-database x) 'TODO) (write-to-database items) items)))
'(("/home")
("/home/dstorrs" #hash((personal-name . "David K. Storrs")))
("/home/dstorrs/writing" #hash((personal-name . "David K. Storrs"))))
Obviously, it’s perfectly possible to use the basic form of trie-unroll+data and do all of the processing separately after the fact. The advantage to doing it through the keywords is that everything is part of the same S-expression and therefore both closer to the data and less likely to get forgotten if one piece of the code is refactored elsewhere.