Menu
Lumberyard
Developer Guide (Version 1.12)

Text Localization and Unicode Support

Because games are typically localized to various languages, your game might have to use text data for many languages.

This document provides programming-related information regarding localization, including localization information specific to Lumberyard.

Terminology

The following table provides brief descriptions of some important terms related to localization and text processing.

Term Description
character A unit of textual data. A character can be a glyph or formatting indicator. Note that a glyph does not necessarily form a single visible unit. For example, a diacritical mark [´] and the letter [a] are separate glyphs (and characters), but can be overlaid to form the character [á].
Unicode A standard maintained by the Unicode Consortium that deals with text and language standardization.
UCS Universal Character Set, the standardized set of characters in the Unicode standard (also, ISO-10646)
(UCS) code-point An integral identifier for a single character in the UCS defined range, typically displayed with the U prefix followed by hexadecimal, for example: U+12AB
(text) encoding A method of mapping (a subset of) UCS to a sequence of code-units, or the process of applying an encoding.
code-unit An encoding-specific unit integral identifier used to encode code-points. Many code-units may be used to represent a single code-point.
ASCII A standardized encoding that covers the first 128 code-points of the UCS space using 7- or 8-bit code-units.
(ANSI) code-page A standardized encoding that extends ASCII by assigning additional meaning to the higher 128 values when using 8-bit code-units There are many hundreds of code-pages, some of which use multi-byte sequences to encode code-points.
UTF UCS Transformation Format, a standardized encoding that covers the entire UCS space.
UTF-8 A specific instance of UTF, using 8-bit code-units. Each code-point can take 1 to 4 (inclusive) code-units.
UTF-16 A specific instance of UTF, using 16-bit code-units. Each code-point can take 1 or 2 code-units.
UTF-32 A specific instance of UTF, using 32-bit code-units. Each code-point is directly mapped to a single code-unit.
byte-order How a CPU treats a sequence of bytes when interpreting multi-byte values. A byte-orderTypically either little-endian or big-endian format
encoding error A sequence of code-units that does not form a code-point (or an invalid code-point, as defined by the Unicode standard)

What encoding to use?

Since there are many methods of encoding text, the question that should be asked when dealing with even the smallest amount of text is, "In what encoding is this stored?" This is an important question because decoding a sequence of code-units in the wrong way will lead to encoding errors, or even worse, to valid decoding that yields the wrong content.

The following table describes some common encodings.

Encoding Code-unit size Code-point size Maps the entire UCS space Trivial to encode/decode Immune to byte-order differences Major users
ASCII 7 bits 1 byte no yes yes Many English-only apps
(ANSI) code-page 8 bits varies, usually 1 byte no varies, usually yes yes Older OS functions
UTF-8 8 bits 1 to 4 bytes yes no yes Most text on the internet, XML
UTF-16 16 bits 2 to 4 bytes yes yes no Windows "wide" API, Qt
UCS-2 16 bits 2 bytes no yes no None (replaced with UTF-16)
UTF-32 UCS-4 32 bits 4 bytes yes yes no Linux "wide" API

Because there is no single "best" encoding, you should always consider the scenario in which it will be used when choosing one.

Historically, different operating systems and software packages have chosen different sets of supported encodings. Even C++ follows different conventions on different operating systems. For example, the "wide character" wchar_t is 16-bits on Windows, but 32-bits on Linux.

Because Lumberyard products can be used on many operating systems and in many languages, full UCS coverage is desirable. The follow table presents some conventions used in Lumberyard:

Text data type Encoding Reason
Source code ASCII We write our code in English, which means ASCII is sufficient.
Text assets UTF-8 Assets can be transferred between machines with potentially differing byte-order, and may contain text in many languages.
Run-time variables UTF-8 Since transforming text data from or to UTF-8 is not free, we keep data in UTF-8 as much as possible. Exceptions must be made when interacting with libraries or operating systems that require another encoding. In these cases all transformations should be done at the call-site.
File and path names ASCII File names are a special case with regards to case-sensitivity, as defined by the file system. Unicode defines 3 cases, and conversions between them are locale-specific. In addition, the normalization formats are typically not (all) accounted for in file-systems and their APIs. Some specialized file-systems only accept ASCII. This combination means that using the most basic and portable sub-set should be preferred, with UTF-8 being used only as required.

General principles

  • Avoid using non-ASCII characters in source code. Consider using escape sequences if a non-ASCII literal is required.

  • Avoid using absolute paths. Only paths that are under developer control should be entered. If possible, use relative ASCII paths for the game folder, root folder, and user folder. When this is not possible, carefully consider non-ASCII contents that may be under a user's control, such as those in the installation folder.

How does this affect me when writing code?

Since single-byte code-units are common (even in languages that also use double-byte code-units), single-byte string types can be used almost universally. In addition, since Lumberyard does not use ANSI code-pages, all text must be either ASCII or UTF-8.

The following properties hold for both ASCII and UTF-8.

  • The NULL-byte (integral value 0) only occurs when a NULL-byte is intended (UTF-8 never generates a NULL-byte as part of multi-byte sequences). This means that C-style null-terminated strings act the same, and CRT functions like strlen will work as expected, except that it counts code-units, not characters.

  • Code-points in the ASCII range have the same encoded value in UTF-8. This means that you can type English string literals in code and treat them as UTF-8 without conversion. Also, you can compare characters in the ASCII range directly against UTF-8 content (that is, when looking for an English or ASCII symbol sub-string).

  • UTF-8 sequences (containing zero or more entire code-points) do not carry context. This means they are safe to append to each other without changing the contents of the text.

The difference between position and length in code-units (as reported through string::length(), strlen(), and similar functions) and their matching position and length in code-points is largely irrelevant. This is because the meaning of the sequence is typically abstract, and the meaning of the bytes matters only when the text is interpreted or displayed. However, keep in mind the following caveats.

  • Splitting strings – When splitting a string, it's important to do one of the following.

    1. Recombine the parts in the same order after splitting, without interpreting the splitted parts as text (that is, without chunking for transmission).

    2. Perform the split at a boundary between code-points. The positions just before and just after any ASCII character are always safe.

  • API boundaries – When an API accepts or returns strings, it's s important to know what encoding the API uses. If the API doesn't treat strings as opaque (that is, interprets the text), passing UTF-8 may be problematic for APIs that accept byte-strings and interpret them as ASCII or ANSI. If no UTF-8 API is available, prefer any other Unicode API instead (UTF-16 or UTF-32). As a last resort, convert to ASCII, but understand that the conversion is lossy and cannot be recovered from the converted string. Always read the documentation of the API to see what text encoding it expects and perform any required conversion. All UTF encodings can be losslessly converted in both directions, so finding any API that accepts a UTF format gives you a way to use UTF encoding.

  • Identifiers – When using strings as a "key" in a collection or for comparison, avoid using non-ASCII sequences as keys, as the concept of "equality" of UTF is complex due to normalization forms and locale-dependent rules. However, comparing UTF-8 strings byte-by-byte is safe if you only care about equality in terms of code-points (since code-point to code-unit mapping is 1:1).

  • Sorting – When using strings for sorting, keep in mind that locale-specific rules for the order of text are complex. It's fine to let the UI deal with this in many cases. In general, make no assumptions of how a set of strings will be sorted. However, sorting UTF-8 strings as if they were ASCII will actually sort them by code-point. This is fine if you only require an arbitrary fixed order for std::map look-up, but displaying contents in the UI in this order may be confusing for end-users that expect another ordering.

In general, avoid interpreting text if at all possible. Otherwise, try to operate on the ASCII subset and treat all other text parts as opaque indivisible sequences. When dealing with the concept of "length" or "size", try to consider using in code-units instead of code-points, since those operations are computationally cheaper. In fact, the concept of the "length" of Unicode sequences is complex, and there is a many-to-many mapping between code-points and what is actually displayed.

How does this affect me when dealing with text assets?

In general, always:

  • Store text assets with UTF-8 encoding.

  • Store with Unicode NFC (Normalization Form C). This is the most common form of storage in text editing tools, so it's best to use this form unless you have a good reason to do otherwise.

  • Store text in the correct case (that is, the one that will be displayed). Case-conversion is a complex topic in many languages and is best avoided.

Utilities provided in CryCommon

Lumberyard provides some utilities to make it easy to losslessly and safely convert text between Unicode encodings. In-depth technical details are provided in the header files that expose the UnicodeFunctions.h and UnicodeIterator.h utilities.

The most common use cases are as follows.

Copy
string utf8; wstring wide; Unicode::Convert(utf8, wide); // Convert contents of wide string and store into UTF-8 string Unicode::Convert(wide, utf8); // Convert contents of UTF-8 string to wide string
Copy
string ascii; Unicode::Convert<Unicode::eEncoding_ASCII, Unicode::eEncoding_UTF8>(ascii, utf8); // Convert UTF-8 to ASCII (lossy!)

Important

The above functions assume that the input text is already validly encoded. To guard against malformed user input or potentially broken input, consider using the Unicode::ConvertSafe function.

Further reading

For an introduction to Unicode, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

For official information about Unicode, see The Unicode Consortium.