9 Defining Languages
Rhombus not only supports macro extensions that add to the
rhombus language, it supports entirely new languages
that are smaller than rhombus or that have different
syntax and semantics. Languages are themselves implemented as Rhombus
modules that follow a particular protocol—
The term language in Rhombus is used to refer to two different kinds of languages for two different contexts:
A language whose name is written after #lang for a module: This kind of language has full control over the parsing of the module’s body at the level of characters and bytes.
A language whose name is written after ~lang for a module form: This kind of language receives its module’s body already parsed at the character level into a syntax object. That parsing is performed by the enclosing module’s language.
These two kinds of languages are connected, because the result of a #lang-triggered parser is a module form (although at the Racket level), and so it includes a ~lang reference (or the equivalent at the Racket level). Furthermore, modules that implement a language are typically set up so that the language name works in both contexts, and the #lang use of the name generates a reference to the ~lang form of the name. For example, rhombus works both after #lang and in module after ~lang, and both uses of the name refer to the same set of bindings.
9.1 Module ~lang Protocol
A module that is intended to be used as a language selected by ~lang in module must export various bindings to work:
#{#%module-begin}: A Racket-level bridge to handle the module body. In a Rhombus-implemented language, this should normally be #{#%module-begin} from rhombus. Using that #{#%module-begin} obliges the module to also export #%module_body.
#%module_body: A Rhombus declaration form that is implicitly wrapped around a module’s body by #{#%module-begin}. The module body is received as a block. Exporting #%module_body from rhombus causes the module body to be treated the same as a sequence of declarations, definitions, and expressions in a rhombus module.
#{#%top-interaction}: A Racket-level bridge to handle forms evaluated in a read-eval-print loop (REPL). REPL evaluation is not mandatory, and if this binding is missing, then interactive evaluation is disabled. In a Rhombus-implemented language, this should normally be #{#%top-interaction} from rhombus. Using that #{#%top-interaction} obliges the module to also export #%interaction (or else interactive evaluation will still be disabled).
#%interaction: A Rhombus declaration form that is implicitly wrapped around interactive evaluation by #{#%top-interaction}. A form sequence o evaluate is received as a block. Exporting #%interaction from rhombus a REPL to work in the same way as for rhombus module.
Other bindings as needed by the language, especially common forms like def and fun and implicit forms like #%call, #%parens, and , #%literal. These bindings, too, are often reexported from rhombus.
For example, the following module defines a language that is like rhombus, but it replaces #%module_body to first print out the source of all forms in the module body. After printing, the body forms are evaluated the same way as in rhombus.
"noisy_rhombus.rhm"
all_from(rhombus):
except #%module_block
module_block as #%module_block
decl.macro 'module_block: $form; ...':
$form
...'
If that module is saved as "noisy_rhombus.rhm", then a module in the same directory can refer to it when declaring a main submodule:
"demo.rhm"
module main ~lang "noisy_rhombus.rhm":
1 + 2 // prints "1 + 2" and then "3"
The name "noisy_rhombus.rhm" does not conform to the syntax of languages that can be written after #lang, and the "noisy_rhombus.rhm" module also doesn’t supply a character-level parser. One way to fill that gap, at least in the short term, is to use the shrubbery language, which parses a module body into shrubbery form and then uses the language module that is named immediately after #lang shrubbery:
"demo2.rhm"
1 + 2 // prints "1 + 2" and then "3"
9.2 Run-Time and Expand-Time Configuration
Although bindings can capture most details of a language definition, certain aspects of the compile-time and run-time environment span all languages that are used to construct a program, and so they must be configured in a different way. For example, the way that values should print may differ for a programmer who is working in terms of Rhombus versus one working in terms of Racket, even when printing is initiated by a library that is meant to be used from either language. Racket allows the main module for a program (e.g., the one provided on the command line) to configure run-time behavior, and it allows the language of a module being compiled to configure compile-time behavior. These configurations take the form of submodules:
A configure_runtime submodule is instantated before its enclosing module when then enclosing module is the main module of a program. Instantiating the submodule is intended to have side effects that configure the environment.
More precisely, a #{configure-runtime} submodule is instantiated, because that is the Racket-level protocol, but the #%module_block form of rhombus arranges for a #{configure-runtime} submodule that depends on configure_runtime.
A configure_runtime submodule is relevant to any Rhombus module, not just a language module. The #%module_block form of rhombus not only adds #{configure-runtime} to trigger configure_runtime, it adds a configure_runtime submodule if one is not explicitly declared in a module body. The automatic configure_runtime submodule depends on rhombus/runtime_config, which configures the environment for working in Rhombus terms.
A language’s configure_runtime submodule is relevant when the language is selected for interactive evaluation in a REPL context, since the language module counts as the main module in that case.
A configure_expand submodule provides enter_parameterization and exit_parameterization functions that are used to configure the expand-time environment while a module using the language is expanded. Instead of a direct side effect, enter_parameterization returns a parameterization that is used while the module is being compiled, and exit_parameterization is called to obtain a more nested parameterization to use when compilation is a dependency is triggered.
More precisely, a #{configure-expand} submodule is instantiated, because that is the Racket-level protocol, but the #%module_block form of rhombus arranges for a #{configure-expand} submodule that uses configure_expand when the latter is present.
If a configure_expand submodule is not explicitly declared in a module body, the #%module_block form of rhombus does not add one automatically—
unless a reader submodule (described in #lang Language Protocol) is present. If a reader is present and not configure_expand, then #%module_block adds a configure_expand submodule that uses rhombus/expand_config.
9.3 #lang Language Protocol
A language name that follows #lang must have only alphanumeric ASCII, +, -, _, and/or / characters terminated by whitespace or an end-of-file. Thus, a language name cannot be a Rhombus string, but must instead be an unquoted module path that refers to a module in a collection.
Furthermore, the unquoted path is turned into a module path in a way that is different from a language name after ~lang in module or in an import form: a ".rkt" suffix is added instead of a ".rhm" suffix (after "/main" is added in the case that / does not appear in the path). Finally, a reader submodule is found within that module. As a fallback, when a reader submodule is not found, a ".rkt" suffix is replaced with "/lang/reader.rkt" and tried as a module path in place of a reader submodule. This fallback is discouraged for new Rhombus and Racket languages.
The reader submodule protocol, which is defined at the Racket level, requires the submodule to export three functions: #{read}, #{read-syntax}, and #{get-info}. The Rhombus-based language rhombus/reader provides a streamlined interface that is convenient for defining Rhombus-like languages.
The key clause in a rhombus/reader module is ~lang followed by module path for the ~lang-protocol module to use for the parsed module. The module can can be relative to the enclosing reader submodule, so parent serves as a reference to the enclosing module. The following example is the same as "moisy_rhombus.rhm" in "tilde-lang", but with a reader submodule added, and saved as "main.rkt" in a "noisy_rhombus" directory (note the ".rkt" extension instead of ".rhm").
"noisy_rhombus/main.rkt"
~lang parent
all_from(rhombus):
except #%module_block
module_block as #%module_block
decl.macro 'module_block: $form; ...':
$form
...'
Assuming that "noisy_rhombus" has been registered as a collection (possibly by installing it as a package with raco pkg install noisy_rhombus/), then noisy_rhombus works as a language name immediately after #lang:
"demo3.rhm"
#lang noisy_rhombus
1 + 2 // prints "1 + 2" and then "3"
A small problem remains here, created by the mismatch between #lang’s interpretation of module names and the Rhombus import interpretation. The #lang interpretation of noisy_rhombus is lib("noisy_rhombus/main.rkt"), while the import interpretation is lib("noisy_rhombus/main.rhm"). Consequently, these following all_from does not work as would be expected:
"demo4.rhm"
#lang noisy_rhombus
all_from(noisy_rhombus) // no `lib("noisy_rhombus/main.rhm")`
In fact, the problem is not so much the #lang interpretation of noisy_rhombus as the use of parent in the reader module. Changing to
~lang "main.rhm"
causes as #lang noisy_rhombus module to use lib("noisy_rhombus/main.rhm") as the initially imported module, and we can create "noisy_rhombus/main.rhm" to reexport "noisy_rhombus/main.rkt":
"noisy_rhombus/main.rhm"
"main.rkt"
Those changes allow "demo4.rhm" to work, but a syntax error in "demo4.rhm" would be reported incorrectly, because "noisy_rhombus/main.rhm" has no configure_expand submodule. The rhombus/lang_bridge module helps complete the picture by reexporting and also propagating submodule definitions and exports.
"noisy_rhombus/main.rhm"
~lang: "main.rkt"
Note that "noisy_rhombus/main.rhm" depends on "noisy_rhombus/main.rkt" while "noisy_rhombus/main.rkt" indirectly depends on "noisy_rhombus/main.rkt". This kind of cycle is allowed, because rhombus/reader delays its reference by quoting the ~lang module name.
In short, a best practice for defining #lang languages with Rhombus is
Create or link a collection as a directly like "noisy_rhombus" (but with a more suitable name).
Export the language’s implementation from "main.rkt" in that directory.
Use rhombus/reader to define a reader submodle in "main.rkt".
Supply ~lang "main.rhm" in the reader submodle.
Create "main.rhm" with #lang rhombus/lang_bridge and use ~lang: "main.rkt" as its body.