Qore Programming Language Reference Manual  0.8.11.1
Strings and Character Encoding

Overview

The Qore language is character-encoding aware. All strings are assumed to have the default character encoding, unless the program explicitly specified another encoding for certain objects and operations. Every Qore string has a character encoding ID attached to it, so, when another encoding is required, the Qore language will attempt to do an encoding translation.

Qore uses the operating system's iconv library functions to perform any encoding conversions.

Qore supports character encodings that are backwards compatible with 7-bit ASCII. This includes all ISO-8859-* character encodings, UTF-8, KOIR-8, KOIU-8, and KOI7, among others (see the table below: Known Character Encodings).

However, mutibyte character encodings are currently only properly supported for UTF-8. For UTF-8 strings, the length(), index(), rindex(), substr(), reverse(), the splice operator, print formatting (regarding field lengths) functions and methods taking format strings, and regular expression operators and functions, all work with character offsets, which may be different than byte offsets. For all character encodings other than UTF-8, a 1 byte=1 character relationship is assumed.

Qore will accept any encoding name given to it, even if it is not a known encoding name or alias. In this case, Qore will tag the strings with this encoding, and pass this user-defined encoding name to the iconv library when encodings must be converted. This allows programmers to use encodings known by the system's iconv library, but unknown to Qore. In this case, Qore will assume that the strings are backwards compatible with ASCII, meaning that that one character is represented by one byte and that the strings are null-terminated.

Note that when Qore matches an encoding name to a code or alias in the following table, the comparison is not case-sensitive.

Character Encodings Known to Qore

Code Aliases Description
ISO-8859-1 ISO88591, ISO8859-1, ISO-88591, ISO8859P1, ISO81, LATIN1, LATIN-1 latin-1, Western European character set
ISO-8859-2 ISO88592, ISO8859-2, ISO-88592, ISO8859P2, ISO82, LATIN2, LATIN-2 latin-2, Central European character set
ISO-8859-3 ISO88593, ISO8859-3, ISO-88593, ISO8859P3, ISO83, LATIN3, LATIN-3 latin-3, Southern European character set
ISO-8859-4 ISO88594, ISO8859-4, ISO-88594, ISO8859P4, ISO84, LATIN4, LATIN-4 latin-4, Northern European character set
ISO-8859-5 ISO88595, ISO8859-5, ISO-88595, ISO8859P5, ISO85 Cyrillic character set
ISO-8859-6 ISO88596, ISO8859-6, ISO-88596, ISO8859P6, ISO86 Arabic character set
ISO-8859-7 ISO88597, ISO8859-7, ISO-88597, ISO8859P7, ISO87 Greek character set
ISO-8859-8 ISO88598, ISO8859-8, ISO-88598, ISO8859P8, ISO88 Hebrew character set
ISO-8859-9 ISO88599, ISO8859-9, ISO-88599, ISO8859P9, ISO89, LATIN5, LATIN-5 latin-5, Turkish character set
ISO-8859-10 ISO885910, ISO8859-10, ISO-885910, ISO8859P10, ISO810, LATIN6, LATIN-6 latin-6, Nordic character set
ISO-8859-11 ISO885911, ISO8859-11, ISO-885911, ISO8859P11, ISO811 Thai character set
ISO-8859-13 ISO885913, ISO8859-13, ISO-885913, ISO8859P13, ISO813, LATIN7, LATIN-7 latin-7, Baltic rim character set
ISO-8859-14 ISO885914, ISO8859-14, ISO-885914, ISO8859P14, ISO814, LATIN8, LATIN-8 latin-8, Celtic character set
ISO-8859-15 ISO885915, ISO8859-15, ISO-885915, ISO8859P15, ISO815, LATIN9, LATIN-9 latin-9, Western European with euro symbol
ISO-8859-16 ISO885916, ISO8859-16, ISO-885916, ISO8859P16, ISO816, LATIN10, LATIN-10 latin-10, Southeast European character set
KOI7 n/a Russian: Kod Obmena Informatsiey, 7 bit characters
KOI8-R KOI8R Russian: Kod Obmena Informatsiey, 8 bit
KOI8-U KOI8U Ukrainian: Kod Obmena Informatsiey, 8 bit
US-ASCII ASCII, USASCII 7-bit ASCII character set
UTF-8 UTF8 variable-width universal character set

Default Character Encoding

The default character encoding for Qore is determined by environment variables.

First, the QORE_CHARSET environment variable is checked. If it is set, then this character encoding will be the default character encoding for the process. If not, then the LANG environment variable is checked. If a character encoding is specified in the LANG environment variable, then it will be used as the default character encoding. Otherwise, if no character encoding can be derived from the environment, UTF-8 is assumed.

Character encodings are automatically converted by the Qore language when necessary. Encoding conversion errors will cause a Qore exception to be thrown. The character encoding conversions supported by Qore depend on the operating system's iconv library function.

Note
The get_default_encoding() function will return the default encoding for the Qore process.

Character Encoding Usage Examples

The following is a non-exhaustive list of examples in Qore where character encoding processing is performed.

Character encodings can be explicitly performed with the convert_encoding() function, and the encoding attached to a string can be checked with the get_encoding() function. If you have a string with incorrect encoding and want to change the encoding tag of the string (without changing the actual bytes of the string), use the force_encoding() function.

get_default_encoding() returns the default encoding for the Qore process.

The Qore::SQL::Datasource, Qore::SQL::DatasourcePool, and Qore::SQL::SQLStatement classes will translate character encodings to the encoding required by the database if necessary as well (this is actually the responsibility of the DBI driver for the database in question).

The Qore::File and Qore::Socket classes translate character encodings to the encoding specified for the object if necessary, as well as tagging strings received or read with the object's encoding.

The Qore::HTTPClient class will translate character encodings to the encoding specified for the object if necessary, as well as tag strings received with the object's encoding. Additionally, if an HTTP server response specifies a specific encoding to use, the encoding of strings read from the server will be automatically set to this encoding as well.