View source code Display the source code in std/utf.d from which this page was generated on github. Improve this page Quickly fork, edit online, and submit a pull request for this page. Requires a signed-in GitHub account. This works well for small changes. If you'd like to make larger changes you may want to consider using local clone. Page wiki View or edit the community-maintained wiki page associated with this page.

Module std.utf

Encode and decode UTF-8, UTF-16 and UTF-32 strings.

UTF character support is restricted to '\u0000' <= character <= '\U0010FFFF'.

See Also

Wikipedia
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n1335

Functions

Name Description
byCodeUnit Iterate a range of char, wchar, or dchars by code unit.
codeLength Returns the number of code units that are required to encode str in a string whose character type is C. This is particularly useful when slicing one string with the length of another and the two string types use different character types.
codeLength Returns the number of code units that are required to encode the code point c when C is the character type used to encode it.
count Returns the total number of code points encoded in str.
decode Decodes and returns the code point starting at str[index]. index is advanced to one past the decoded code point. If the code point is not well-formed, then a UTFException is thrown and index remains unchanged.
decodeFront decodeFront is a variant of decode which specifically decodes the first code point. Unlike decode, decodeFront accepts any input range of code units (rather than just a string or random access range). It also takes the range by ref and pops off the elements as it decodes them. If numCodeUnits is passed in, it gets set to the number of code units which were in the code point which was decoded.
encode Encodes c in str's encoding and appends it to str.
encode Encodes c into the static array, buf, and returns the actual length of the encoded character (a number between 1 and 4 for char[4] buffers and a number between 1 and 2 for wchar[2] buffers).
encode Encodes c in str's encoding and appends it to str.
isValidDchar Returns whether c is a valid UTF-32 character.
stride stride returns the length of the UTF-32 sequence starting at index in str.
stride stride returns the length of the UTF-16 sequence starting at index in str.
stride stride returns the length of the UTF-8 sequence starting at index in str.
stride stride returns the length of the UTF-16 sequence starting at index in str.
strideBack strideBack returns the length of the UTF-32 sequence ending one code unit before index in str.
strideBack strideBack returns the length of the UTF-16 sequence ending one code unit before index in str.
strideBack strideBack returns the length of the UTF-8 sequence ending one code unit before index in str.
toUCSindex Given index into str and assuming that index is at the start of a UTF sequence, toUCSindex determines the number of UCS characters up to index. So, index is the index of a code unit at the beginning of a code point, and the return value is how many code points into the string that that code point is.
toUTF16 Encodes string s into UTF-16 and returns the encoded string.
toUTF16z toUTF16z is a convenience function for toUTFz!(const(wchar)*).
toUTF32 Encodes string s into UTF-32 and returns the encoded string.
toUTF8 Encodes string s into UTF-8 and returns the encoded string.
toUTFindex Given a UCS index n into str, returns the UTF index. So, n is how many code points into the string the code point is, and the array index of the code unit is returned.
validate Checks to see if str is well-formed unicode or not.

Classes

Name Description
UTFException Exception thrown on errors in std.utf functions.

Templates

Name Description
byUTF Iterate an input range of characters by char type C.
toUTFz Returns a C-style zero-terminated string equivalent to str. str must not contain embedded '\0''s as any C function will treat the first '\0' that it sees as the end of the string. If str.empty is true, then a string containing only '\0' is returned.

Enum values

Name Type Description
replacementDchar Inserted in place of invalid UTF sequences.

Aliases

Name Type Description
byChar Iterate an input range of characters by char, wchar, or dchar. These aliases simply forward to byUTF with the corresponding C argument.
byDchar Iterate an input range of characters by char, wchar, or dchar. These aliases simply forward to byUTF with the corresponding C argument.
byWchar Iterate an input range of characters by char, wchar, or dchar. These aliases simply forward to byUTF with the corresponding C argument.
UseReplacementDchar Flag!("useReplacementDchar") Whether or not to replace invalid UTF with replacementDchar

Authors

Walter Bright and Jonathan M Davis

License

Boost License 1.0.

Comments