6 Document Segments
Segments are an approach to dividing the text of a TEI document into a linear stream of logical groupings that share certain metadata, such as location and authorship. Dividing a TEI document into such groupings is a common requirement for many applications: search (as in term-search) has been our initial motivating use-case, but the same process is needed to, for example, plot trends in the use of a particular term over the course of a book.
This library provides tei-document-segments, which implements the common functionality needed to divide a TEI document into segments. It also defines an extensible interface for working with segment metadata.
6.1 Segment Basics
A segment value represents a contiguous logical subdivision of a TEI document. While the XML structure of TEI documents involves nested and overlapping hierarchies, segments present a linear view of a document.
To a first approximation, a segment might correspond to a paragraph. All of the textual content that falls within a segment shares the same metadata: for example, a segment might come from chapter one, pages 2–3, and have been written by Paul Ricœur. In fact, segments can be more granular than paragraphs: a paragraph with a footnote in the middle might be divided into several segments. On the other hand, in a TEI document for which we have not yet completed paragraph inference (see tei-document-guess-paragraphs), segments might be based on page breaks and could be longer than a paragraph.
As the above suggests, a segment is assosciated with a specific TEI document: not just the identity of the instance, as might be determined by instance-title/symbol, but with the state of the TEI document itself as reflected by tei-document-checksum.
Segments are a general, extensible way of managing this contextual information: concrete applications are likely to implement specialized representations, and these can support the segment interface using prop:segment.
This library defines two built-in kinds of segments—
procedure
(tei-document-segments doc) → (listof base-segment?)
doc : tei-document?
procedure
(base-segment? v) → any/c
v : any/c
match expander
(base-segment meta-pat body-pat maybe-info-pat)
maybe-info-pat =
| plain-instance-info-pat
procedure
(base-segment-meta seg) → segment-meta?
seg : base-segment?
procedure
(base-segment-body seg) →
(and/c string-immutable/c #px"[^\\s]") seg : base-segment?
procedure
seg : base-segment?
A base segment can be used with the instance info interface to access bibliographic information about the instance represented by the TEI document from which it was created.
In addition to metadata, a base segment also contains the full textual data of the segment, but this is not a requirement: most other kinds of segment values will likely not wish to do so.
6.2 Working with Segments
procedure
(segment-get-meta seg) → segment-meta?
seg : segment?
procedure
(segment-meta? v) → any/c
v : any/c
procedure
(segment-meta=? a b) → boolean?
a : segment? b : segment?
In addition to being the most minimal representation of a segment, segment metadata values can be serialized with racket/serialize.
The function segment-meta=? tests segments for equality based on their segment metadata values: it will consider segments of different specific types “the same” if they have equivalent segment metadata values. Any segments that are segment-meta=? can be used interchangably for the purposes of the functions documented in this section.
match expander
(segment kw-pat ...)
kw-pat = #:title/symbol title/symbol-pat | #:checksum checksum-pat | #:counter counter-pat | #:resp-string resp-string-pat | #:page-spec page-spec-pat | #:location-stack location-stack-pat
Each keyword may appear at most once.
procedure
seg : segment?
procedure
(segment-by-ricoeur? seg) → boolean?
seg : segment?
Internally, segment-resp-string obtains a string suitable for display to end-users naming the “responsible party” for the segment (such as Ricœur, an editor, or a translator) using lower-level functions such as tei-element-resp and instance-get-resp-string.
The predicate segment-by-ricoeur? recognizes only segments by Ricœur himself.
procedure
(segment-page-spec seg) → page-spec/c
seg : segment?
value
=
(or/c (maybe/c string-immutable/c) (list/c (maybe/c string-immutable/c) (maybe/c string-immutable/c)))
If the returned value is a two-element list, the segment spans more than one page: the first element of such a list represents the page on which the segment starts, and the second element the page on which it ends. Otherwise, if the returned value is not a list, the segment is fully contained in a single page, and the value represents that page.
In either case, a value of (nothing) signifies that the pb element it represents was not numbered (i.e. it had no n attribute). A just value contains the page number, taken from the value of the corresponding pb’s n attribute.
procedure
seg : segment?
procedure
(location-stack->strings location-stack)
→ (listof string-immutable/c) location-stack : location-stack?
procedure
(location-stack? v) → any/c
v : any/c
A location stack can be converted to a list of strings suitable for display to end-users via location-stack->strings. The strings in the resulting list describe the location from the broadest level of organization to the narrowist (e.g. '("Chapter 1" "Footnote 3"), though the precise textual content of the returned strings is unspecified).
procedure
(segment-title/symbol seg) → symbol?
seg : segment?
procedure
(segment-document-checksum seg) → symbol?
seg : segment?
value
Sorting segments according to segment-order’s less-than relation places them in the order in which they occurred in the source TEI document.
procedure
(segment-counter seg) → natural-number/c
seg : segment?
6.3 Implementing New Types of Segments
value
prop:segment :
(struct-type-property/c (-> any/c segment-meta?))
The value for the property must be a function that accepts an instance of the new structure type and returns a segment metadata value. An instance of the new structure type will satisfy segment? and can be used with any of the functions above equivalently to using the returned segment metadata value directly.
The function given as a value for prop:segment should always return the very same segment metadata value when called with the same argument. This invariant is not currently checked, but may be in the future.