0.45+9.2.0.2

4.2 Encodings and Locales🔗ℹ

When a character-based operation is used on a port, such as Port.Input.read_char or shrubbery.read, the port’s bytes are read and interpreted as a UTF-8 encoding of characters. Thus, reading a single character may require reading multiple bytes, and an operation like Port.Input.peek_char may need to peek several bytes into the stream to determine whether a character is available. In the case of a byte stream that does not correspond to a valid UTF-8 encoding, operations such as Port.Input.read_char may need to peek one byte ahead in the stream to discover that the stream is not a valid encoding.

When an input port produces a sequence of bytes that is not a valid UTF-8 encoding in a character-reading context, then bytes that constitute an invalid sequence are converted to the character Char"\uFFFD". Specifically, bytes 255 and 254 are always converted to Char"\uFFFD", bytes in the range 192 to 253 produce Char"\uFFFD" when they are not followed by bytes that form a valid UTF-8 encoding, and bytes in the range 128 to 191 are converted to Char"\uFFFD" when they are not part of a valid encoding that was started by a preceding byte in the range 192 to 253. To put it another way, when reading a sequence of bytes as characters, a minimal set of bytes are changed to the encoding of Char"\uFFFD" so that the entire sequence of bytes is a valid UTF-8 encoding.

See Encoding and Conversion for facilities to work with UTF-8 or other encodings. See also port.reencode_input and port.reencode_output for obtaining a UTF-8-based port from one that uses a different encoding of characters.

A locale captures information about a user’s language-specific interpretation of character sequences. In particular, a locale determines how strings are “alphabetized,” how a lowercase character is converted to an uppercase character, and how strings are compared without regard to case. String operations using the StringCI veneer are not sensitive to the current locale, but operations using the StringLocale or StringLocaleCI veneer produce results consistent with the current locale.

A locale also designates a particular encoding of code-point sequences into byte sequences. Rhombus generally ignores this aspect of the locale, with a few notable exceptions: command-line arguments passed to Rhombus as byte strings are converted to character strings using the locale’s encoding; command-line strings passed as byte strings to other processes (through subprocess.run) are converted to byte strings using the locale’s encoding; environment variables are converted to and from strings using the locale’s encoding; filesystem paths are converted to and from strings (for display purposes) using the locale’s encoding; and, finally, Rhombus provides operations such as String.locale_bytes and Bytes.locale_string to specifically invoke a locale-specific encoding.

A Unix user selects a locale by setting environment variables, such as LC_ALL. On Windows and Mac OS, the operating system provides other mechanisms for setting the locale. Within Racket, the current locale can be changed by setting the bytes.current_locale parameter. The locale name within Rhombus is a string, and the available locale names depend on the platform and its configuration, but the "" locale means the current user’s default locale; on Windows and Mac OS, the encoding for "" is always UTF-8, and locale-sensitive operations use the operating system’s native interface. (In particular, setting the LC_ALL and LC_CTYPE environment variables does not affect the locale "" on Mac OS. Use envvars.getenv and bytes.current_locale to explicitly install the environment-specified locale, if desired.) Setting the current locale to #false makes locale-sensitive operations locale-insensitive, which means using the Unicode mapping for case operations and using UTF-8 for encoding.