View source code Display the source code in std/uni.d from which this page was generated on github. Improve this page Quickly fork, edit online, and submit a pull request for this page. Requires a signed-in GitHub account. This works well for small changes. If you'd like to make larger changes you may want to consider using local clone. Page wiki View or edit the community-maintained wiki page associated with this page.

Module `std.uni`

The std.uni module provides an implementation of fundamental Unicode algorithms and data structures. This doesn't include UTF encoding and decoding primitives, see std.utf.decode and std.utf.encode in std.utf for this functionality.

All primitives listed operate on Unicode characters and sets of characters. For functions which operate on ASCII characters and ignore Unicode characters, see std.ascii. For definitions of Unicode character, code point and other terms used throughout this module see the terminology section below.

The focus of this module is the core needs of developing Unicode-aware applications. To that effect it provides the following optimized primitives:

Character classification by category and common properties: isAlpha, isWhite and others.
Case-insensitive string comparison (sicmp, icmp).
Converting text to any of the four normalization forms via normalize.
Decoding (decodeGrapheme) and iteration (byGrapheme, graphemeStride) by user-perceived characters, that is by Grapheme clusters.
Decomposing and composing of individual character(s) according to canonical or compatibility rules, see compose and decompose, including the specific version for Hangul syllables composeJamo and decomposeHangul.

It's recognized that an application may need further enhancements and extensions, such as less commonly known algorithms, or tailoring existing ones for region specific needs. To help users with building any extra functionality beyond the core primitives, the module provides:

CodepointSet, a type for easy manipulation of sets of characters. Besides the typical set algebra it provides an unusual feature: a D source code generator for detection of code points in this set. This is a boon for meta-programming parser frameworks, and is used internally to power classification in small sets like isWhite.
A way to construct optimal packed multi-stage tables also known as a special case of Trie. The functions codepointTrie, codepointSetTrie construct custom tries that map dchar to value. The end result is a fast and predictable Ο(1) lookup that powers functions like isAlpha and combiningClass, but for user-defined data sets.
A useful technique for Unicode-aware parsers that perform character classification of encoded code points is to avoid unnecassary decoding at all costs. utfMatcher provides an improvement over the usual workflow of decode-classify-process, combining the decoding and classification steps. By extracting necessary bits directly from encoded code units matchers achieve significant performance improvements. See MatcherConcept for the common interface of UTF matchers.
Generally useful building blocks for customized normalization: combiningClass for querying combining class and allowedIn for testing the Quick_Check property of a given normalization form.
Access to a large selection of commonly used sets of code points. Supported sets include Script, Block and General Category. The exact contents of a set can be observed in the CLDR utility, on the property index page of the Unicode website. See unicode for easy and (optionally) compile-time checked set queries.

Synopsis

import std.uni;
void main()
{
    // initialize code point sets using script/block or property name
    // now 'set' contains code points from both scripts.
    auto set = unicode("Cyrillic") | unicode("Armenian");
    // same thing but simpler and checked at compile-time
    auto ascii = unicode.ASCII;
    auto currency = unicode.Currency_Symbol;

    // easy set ops
    auto a = set & ascii;
    assert(a.empty); // as it has no intersection with ascii
    a = set | ascii;
    auto b = currency - a; // subtract all ASCII, Cyrillic and Armenian

    // some properties of code point sets
    assert(b.length > 45); // 46 items in Unicode 6.1, even more in 6.2
    // testing presence of a code point in a set
    // is just fine, it is O(logN)
    assert(!b['$']);
    assert(!b['\u058F']); // Armenian dram sign
    assert(b['¥']);

    // building fast lookup tables, these guarantee O(1) complexity
    // 1-level Trie lookup table essentially a huge bit-set ~262Kb
    auto oneTrie = toTrie!1(b);
    // 2-level far more compact but typically slightly slower
    auto twoTrie = toTrie!2(b);
    // 3-level even smaller, and a bit slower yet
    auto threeTrie = toTrie!3(b);
    assert(oneTrie['£']);
    assert(twoTrie['£']);
    assert(threeTrie['£']);

    // build the trie with the most sensible trie level
    // and bind it as a functor
    auto cyrillicOrArmenian = toDelegate(set);
    auto balance = find!(cyrillicOrArmenian)("Hello ընկեր!");
    assert(balance == "ընկեր!");
    // compatible with bool delegate(dchar)
    bool delegate(dchar) bindIt = cyrillicOrArmenian;

    // Normalization
    string s = "Plain ascii (and not only), is always normalized!";
    assert(s is normalize(s));// is the same string

    string nonS = "A\u0308ffin"; // A ligature
    auto nS = normalize(nonS); // to NFC, the W3C endorsed standard
    assert(nS == "Äffin");
    assert(nS != nonS);
    string composed = "Äffin";

    assert(normalize!NFD(composed) == "A\u0308ffin");
    // to NFKD, compatibility decomposition useful for fuzzy matching/searching
    assert(normalize!NFKD("2¹⁰") == "210");
}

Terminology

The following is a list of important Unicode notions and definitions. Any conventions used specifically in this module alone are marked as such. The descriptions are based on the formal definition as found in chapter three of The Unicode Standard Core Specification.

Abstract character

A unit of information used for the organization, control, or representation of textual data. Note that:

When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, visual).

An abstract character has no concrete form and should not be confused with a glyph.

An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with a Grapheme.

The abstract characters encoded (see Encoded character) are known as Unicode abstract characters.

Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences.

Canonical decomposition

The decomposition of a character or character sequence that results from recursively applying the canonical mappings found in the Unicode Character Database and these described in Conjoining Jamo Behavior (section 12 of Unicode Conformance).

Canonical composition

The precise definition of the Canonical composition is the algorithm as specified in Unicode Conformance section 11. Informally it's the process that does the reverse of the canonical decomposition with the addition of certain rules that e.g. prevent legacy characters from appearing in the composed result.

Canonical equivalent

Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical.

Character

Typically differs by context. For the purpose of this documentation the term character implies encoded character, that is, a code point having an assigned abstract character (a symbolic meaning).

Code point

Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF (hex). Not all code points are assigned to encoded characters.

Code unit

The minimal bit combination that can represent a unit of encoded text for processing or interchange. Depending on the encoding this could be: 8-bit code units in the UTF-8 (char), 16-bit code units in the UTF-16 (wchar), and 32-bit code units in the UTF-32 (dchar). Note that in UTF-32, a code unit is a code point and is represented by the D dchar type.

Combining character

A character with the General Category of Combining Mark(M).

All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero combining class.
These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.

Combining class

A numerical value used by the Unicode Canonical Ordering Algorithm to determine which sequences of combining marks are to be considered canonically equivalent and which are not.

Compatibility decomposition

The decomposition of a character or character sequence that results from recursively applying both the compatibility mappings and the canonical mappings found in the Unicode Character Database, and those described in Conjoining Jamo Behavior no characters can be further decomposed.

Compatibility equivalent

Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical.

Encoded character

An association (or mapping) between an abstract character and a code point.

Glyph

The actual, concrete image of a glyph representation having been rasterized or otherwise imaged onto some display surface.

Grapheme base

A character with the property Grapheme_Base, or any standard Korean syllable block.

Grapheme cluster

Defined as the text between grapheme boundaries as specified by Unicode Standard Annex #29, Unicode text segmentation. Important general properties of a grapheme:

The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.
A grapheme cluster typically starts with a grapheme base and then extends across any subsequent sequence of nonspacing marks. A grapheme cluster is most directly relevant to text rendering and processes such as cursor placement and text selection in editing, but may also be relevant to comparison and searching.
For many processes, a grapheme cluster behaves as if it was a single character with the same properties as its grapheme base. Effectively, nonspacing marks apply graphically to the base, but do not change its properties.

This module defines a number of primitives that work with graphemes: Grapheme, decodeGrapheme and graphemeStride. All of them are using extended grapheme boundaries as defined in the aforementioned standard annex.

Nonspacing mark

A combining character with the General Category of Nonspacing Mark (Mn) or Enclosing Mark (Me).

Spacing mark

A combining character that is not a nonspacing mark.

Normalization

The concepts of canonical equivalent or compatibility equivalent characters in the Unicode Standard make it necessary to have a full, formal definition of equivalence for Unicode strings. String equivalence is determined by a process called normalization, whereby strings are converted into forms which are compared directly for identity. This is the primary goal of the normalization process, see the function normalize to convert into any of the four defined forms.

A very important attribute of the Unicode Normalization Forms is that they must remain stable between versions of the Unicode Standard. A Unicode string normalized to a particular Unicode Normalization Form in one version of the standard is guaranteed to remain in that Normalization Form for implementations of future versions of the standard.

The Unicode Standard specifies four normalization forms. Informally, two of these forms are defined by maximal decomposition of equivalent sequences, and two of these forms are defined by maximal composition of equivalent sequences.

Normalization Form D (NFD): The canonical decomposition of a character sequence.
Normalization Form KD (NFKD): The compatibility decomposition of a character sequence.
Normalization Form C (NFC): The canonical composition of the canonical decomposition of a coded character sequence.
Normalization Form KC (NFKC): The canonical composition of the compatibility decomposition of a character sequence

The choice of the normalization form depends on the particular use case. NFC is the best form for general text, since it's more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns. NFD and NFKD are the most useful for internal processing.

Construction of lookup tables

The Unicode standard describes a set of algorithms that depend on having the ability to quickly look up various properties of a code point. Given the the codespace of about 1 million code points, it is not a trivial task to provide a space-efficient solution for the multitude of properties.

Common approaches such as hash-tables or binary search over sorted code point intervals (as in InversionList) are insufficient. Hash-tables have enormous memory footprint and binary search over intervals is not fast enough for some heavy-duty algorithms.

The recommended solution (see Unicode Implementation Guidelines) is using multi-stage tables that are an implementation of the Trie data structure with integer keys and a fixed number of stages. For the remainder of the section this will be called a fixed trie. The following describes a particular implementation that is aimed for the speed of access at the expense of ideal size savings.

Taking a 2-level Trie as an example the principle of operation is as follows. Split the number of bits in a key (code point, 21 bits) into 2 components (e.g. 15 and 8). The first is the number of bits in the index of the trie and the other is number of bits in each page of the trie. The layout of the trie is then an array of size 2^^bits-of-index followed an array of memory chunks of size 2^^bits-of-page/bits-per-element.

The number of pages is variable (but not less then 1) unlike the number of entries in the index. The slots of the index all have to contain a number of a page that is present. The lookup is then just a couple of operations - slice the upper bits, lookup an index for these, take a page at this index and use the lower bits as an offset within this page.

Assuming that pages are laid out consequently in one array at pages, the pseudo-code is:

auto elemsPerPage = (2 ^^ bits_per_page) / Value.sizeOfInBits;
pages[index[n >> bits_per_page]][n & (elemsPerPage - 1)];

Where if elemsPerPage is a power of 2 the whole process is a handful of simple instructions and 2 array reads. Subsequent levels of the trie are introduced by recursing on this notion - the index array is treated as values. The number of bits in index is then again split into 2 parts, with pages over 'current-index' and the new 'upper-index'.

For completeness a level 1 trie is simply an array. The current implementation takes advantage of bit-packing values when the range is known to be limited in advance (such as bool). See also BitPacked for enforcing it manually. The major size advantage however comes from the fact that multiple identical pages on every level are merged by construction.

The process of constructing a trie is more involved and is hidden from the user in a form of the convenience functions codepointTrie, codepointSetTrie and the even more convenient toTrie. In general a set or built-in AA with dchar type can be turned into a trie. The trie object in this module is read-only (immutable); it's effectively frozen after construction.

Unicode properties

This is a full list of Unicode properties accessible through unicode with specific helpers per category nested within. Consult the CLDR utility when in doubt about the contents of a particular set.

General category sets listed below are only accessible with the unicode shorthand accessor.

**General category**
Abb.	Long form	Abb.	Long form	Abb.	Long form
L	Letter	Cn	Unassigned	Po	Other_Punctuation
Ll	Lowercase_Letter	Co	Private_Use	Ps	Open_Punctuation
Lm	Modifier_Letter	Cs	Surrogate	S	Symbol
Lo	Other_Letter	N	Number	Sc	Currency_Symbol
Lt	Titlecase_Letter	Nd	Decimal_Number	Sk	Modifier_Symbol
Lu	Uppercase_Letter	Nl	Letter_Number	Sm	Math_Symbol
M	Mark	No	Other_Number	So	Other_Symbol
Mc	Spacing_Mark	P	Punctuation	Z	Separator
Me	Enclosing_Mark	Pc	Connector_Punctuation	Zl	Line_Separator
Mn	Nonspacing_Mark	Pd	Dash_Punctuation	Zp	Paragraph_Separator
C	Other	Pe	Close_Punctuation	Zs	Space_Separator
Cc	Control	Pf	Final_Punctuation	-	Any
Cf	Format	Pi	Initial_Punctuation	-	ASCII

Sets for other commonly useful properties that are accessible with unicode:

**Common binary properties**
Name	Name	Name
Alphabetic	Ideographic	Other_Uppercase
ASCII_Hex_Digit	IDS_Binary_Operator	Pattern_Syntax
Bidi_Control	ID_Start	Pattern_White_Space
Cased	IDS_Trinary_Operator	Quotation_Mark
Case_Ignorable	Join_Control	Radical
Dash	Logical_Order_Exception	Soft_Dotted
Default_Ignorable_Code_Point	Lowercase	STerm
Deprecated	Math	Terminal_Punctuation
Diacritic	Noncharacter_Code_Point	Unified_Ideograph
Extender	Other_Alphabetic	Uppercase
Grapheme_Base	Other_Default_Ignorable_Code_Point	Variation_Selector
Grapheme_Extend	Other_Grapheme_Extend	White_Space
Grapheme_Link	Other_ID_Continue	XID_Continue
Hex_Digit	Other_ID_Start	XID_Start
Hyphen	Other_Lowercase
ID_Continue	Other_Math

Bellow is the table with block names accepted by unicode.block. Note that the shorthand version unicode requires "In" to be prepended to the names of blocks so as to disambiguate scripts and blocks.

**Blocks**
Aegean Numbers	Ethiopic Extended	Mongolian
Alchemical Symbols	Ethiopic Extended-A	Musical Symbols
Alphabetic Presentation Forms	Ethiopic Supplement	Myanmar
Ancient Greek Musical Notation	General Punctuation	Myanmar Extended-A
Ancient Greek Numbers	Geometric Shapes	New Tai Lue
Ancient Symbols	Georgian	NKo
Arabic	Georgian Supplement	Number Forms
Arabic Extended-A	Glagolitic	Ogham
Arabic Mathematical Alphabetic Symbols	Gothic	Ol Chiki
Arabic Presentation Forms-A	Greek and Coptic	Old Italic
Arabic Presentation Forms-B	Greek Extended	Old Persian
Arabic Supplement	Gujarati	Old South Arabian
Armenian	Gurmukhi	Old Turkic
Arrows	Halfwidth and Fullwidth Forms	Optical Character Recognition
Avestan	Hangul Compatibility Jamo	Oriya
Balinese	Hangul Jamo	Osmanya
Bamum	Hangul Jamo Extended-A	Phags-pa
Bamum Supplement	Hangul Jamo Extended-B	Phaistos Disc
Basic Latin	Hangul Syllables	Phoenician
Batak	Hanunoo	Phonetic Extensions
Bengali	Hebrew	Phonetic Extensions Supplement
Block Elements	High Private Use Surrogates	Playing Cards
Bopomofo	High Surrogates	Private Use Area
Bopomofo Extended	Hiragana	Rejang
Box Drawing	Ideographic Description Characters	Rumi Numeral Symbols
Brahmi	Imperial Aramaic	Runic
Braille Patterns	Inscriptional Pahlavi	Samaritan
Buginese	Inscriptional Parthian	Saurashtra
Buhid	IPA Extensions	Sharada
Byzantine Musical Symbols	Javanese	Shavian
Carian	Kaithi	Sinhala
Chakma	Kana Supplement	Small Form Variants
Cham	Kanbun	Sora Sompeng
Cherokee	Kangxi Radicals	Spacing Modifier Letters
CJK Compatibility	Kannada	Specials
CJK Compatibility Forms	Katakana	Sundanese
CJK Compatibility Ideographs	Katakana Phonetic Extensions	Sundanese Supplement
CJK Compatibility Ideographs Supplement	Kayah Li	Superscripts and Subscripts
CJK Radicals Supplement	Kharoshthi	Supplemental Arrows-A
CJK Strokes	Khmer	Supplemental Arrows-B
CJK Symbols and Punctuation	Khmer Symbols	Supplemental Mathematical Operators
CJK Unified Ideographs	Lao	Supplemental Punctuation
CJK Unified Ideographs Extension A	Latin-1 Supplement	Supplementary Private Use Area-A
CJK Unified Ideographs Extension B	Latin Extended-A	Supplementary Private Use Area-B
CJK Unified Ideographs Extension C	Latin Extended Additional	Syloti Nagri
CJK Unified Ideographs Extension D	Latin Extended-B	Syriac
Combining Diacritical Marks	Latin Extended-C	Tagalog
Combining Diacritical Marks for Symbols	Latin Extended-D	Tagbanwa
Combining Diacritical Marks Supplement	Lepcha	Tags
Combining Half Marks	Letterlike Symbols	Tai Le
Common Indic Number Forms	Limbu	Tai Tham
Control Pictures	Linear B Ideograms	Tai Viet
Coptic	Linear B Syllabary	Tai Xuan Jing Symbols
Counting Rod Numerals	Lisu	Takri
Cuneiform	Low Surrogates	Tamil
Cuneiform Numbers and Punctuation	Lycian	Telugu
Currency Symbols	Lydian	Thaana
Cypriot Syllabary	Mahjong Tiles	Thai
Cyrillic	Malayalam	Tibetan
Cyrillic Extended-A	Mandaic	Tifinagh
Cyrillic Extended-B	Mathematical Alphanumeric Symbols	Transport And Map Symbols
Cyrillic Supplement	Mathematical Operators	Ugaritic
Deseret	Meetei Mayek	Unified Canadian Aboriginal Syllabics
Devanagari	Meetei Mayek Extensions	Unified Canadian Aboriginal Syllabics Extended
Devanagari Extended	Meroitic Cursive	Vai
Dingbats	Meroitic Hieroglyphs	Variation Selectors
Domino Tiles	Miao	Variation Selectors Supplement
Egyptian Hieroglyphs	Miscellaneous Mathematical Symbols-A	Vedic Extensions
Emoticons	Miscellaneous Mathematical Symbols-B	Vertical Forms
Enclosed Alphanumerics	Miscellaneous Symbols	Yijing Hexagram Symbols
Enclosed Alphanumeric Supplement	Miscellaneous Symbols and Arrows	Yi Radicals
Enclosed CJK Letters and Months	Miscellaneous Symbols And Pictographs	Yi Syllables
Enclosed Ideographic Supplement	Miscellaneous Technical
Ethiopic	Modifier Tone Letters

Bellow is the table with script names accepted by unicode.script and by the shorthand version unicode:

**Scripts**
Arabic	Hanunoo	Old_Italic
Armenian	Hebrew	Old_Persian
Avestan	Hiragana	Old_South_Arabian
Balinese	Imperial_Aramaic	Old_Turkic
Bamum	Inherited	Oriya
Batak	Inscriptional_Pahlavi	Osmanya
Bengali	Inscriptional_Parthian	Phags_Pa
Bopomofo	Javanese	Phoenician
Brahmi	Kaithi	Rejang
Braille	Kannada	Runic
Buginese	Katakana	Samaritan
Buhid	Kayah_Li	Saurashtra
Canadian_Aboriginal	Kharoshthi	Sharada
Carian	Khmer	Shavian
Chakma	Lao	Sinhala
Cham	Latin	Sora_Sompeng
Cherokee	Lepcha	Sundanese
Common	Limbu	Syloti_Nagri
Coptic	Linear_B	Syriac
Cuneiform	Lisu	Tagalog
Cypriot	Lycian	Tagbanwa
Cyrillic	Lydian	Tai_Le
Deseret	Malayalam	Tai_Tham
Devanagari	Mandaic	Tai_Viet
Egyptian_Hieroglyphs	Meetei_Mayek	Takri
Ethiopic	Meroitic_Cursive	Tamil
Georgian	Meroitic_Hieroglyphs	Telugu
Glagolitic	Miao	Thaana
Gothic	Mongolian	Thai
Greek	Myanmar	Tibetan
Gujarati	New_Tai_Lue	Tifinagh
Gurmukhi	Nko	Ugaritic
Han	Ogham	Vai
Hangul	Ol_Chiki	Yi

Bellow is the table of names accepted by unicode.hangulSyllableType.

**Hangul syllable type**
Abb.	Long form
L	Leading_Jamo
LV	LV_Syllable
LVT	LVT_Syllable
T	Trailing_Jamo
V	Vowel_Jamo

References

ASCII Table, Wikipedia, The Unicode Consortium, Unicode normalization forms, Unicode text segmentation Unicode Implementation Guidelines Unicode Conformance

Trademarks

Unicode(tm) is a trademark of Unicode, Inc.

Standards

Unicode v6.2

Functions

Name	Description
`allowedIn`	Tests if dchar `ch` is always allowed (Quick_Check=YES) in normalization form `norm`.
`asCapitalized`	Capitalize input range or string, meaning convert the first character to upper case and subsequent characters to lower case.
`asLowerCase`	Convert input range or string to upper or lower case.
`asUpperCase`	Convert input range or string to upper or lower case.
`byCodePoint`	Lazily transform a `range` of Graphemes to a `range` of code points.
`byGrapheme`	Iterate a string by grapheme.
`combiningClass`	Returns the Combining class, combining class of `ch`.
`compose`	Try to canonically `compose` 2 . Returns the composed if they do `compose` and dchar.init otherwise.
`composeJamo`	Try to `compose` hangul syllable out of a leading consonant (`lead`), a `vowel` and optional `trailing` consonant jamos.
`decodeGrapheme`	Reads one full grapheme cluster from an input range of dchar `inp`.
`decompose`	Returns a full Canonical decomposition, Canonical (by default) or Compatibility decomposition, Compatibility decomposition of `ch`. If no decomposition is available returns a `Grapheme` with the `ch` itself.
`decomposeHangul`	Decomposes a Hangul syllable. If `ch` is not a composed syllable then this function returns `Grapheme` containing only `ch` as is.
`graphemeStride`	Computes the length of grapheme cluster starting at `index`. Both the resulting length and the `index` are measured in Code unit, code units.
`icmp`	Does case insensitive comparison of `str1` and `str2`. Follows the rules of full case-folding mapping. This includes matching as equal german ß with "ss" and other 1:M mappings unlike `sicmp`. The cost of `icmp` being pedantically correct is slightly worse performance.
`isAlpha`	Returns whether `c` is a Unicode alphabetic (general Unicode category: Alphabetic).
`isControl`	Returns whether `c` is a Unicode control (general Unicode category: Cc).
`isFormat`	Returns whether `c` is a Unicode formatting (general Unicode category: Cf).
`isGraphical`	Returns whether `c` is a Unicode graphical (general Unicode category: L, M, N, P, S, Zs).
`isLower`	Return whether `c` is a Unicode lowercase .
`isMark`	Returns whether `c` is a Unicode mark (general Unicode category: Mn, Me, Mc).
`isNonCharacter`	Returns whether `c` is a Unicode non-character i.e. a with no assigned abstract character. (general Unicode category: Cn)
`isNumber`	Returns whether `c` is a Unicode numerical (general Unicode category: Nd, Nl, No).
`isPrivateUse`	Returns whether `c` is a Unicode Private Use (general Unicode category: Co).
`isPunctuation`	Returns whether `c` is a Unicode punctuation (general Unicode category: Pd, Ps, Pe, Pc, Po, Pi, Pf).
`isSpace`	Returns whether `c` is a Unicode space (general Unicode category: Zs)
`isSurrogate`	Returns whether `c` is a Unicode surrogate (general Unicode category: Cs).
`isSurrogateHi`	Returns whether `c` is a Unicode high surrogate (lead surrogate).
`isSurrogateLo`	Returns whether `c` is a Unicode low surrogate (trail surrogate).
`isSymbol`	Returns whether `c` is a Unicode symbol (general Unicode category: Sm, Sc, Sk, So).
`isUpper`	Return whether `c` is a Unicode uppercase .
`isWhite`	Whether or not `c` is a Unicode whitespace . (general Unicode category: Part of C0(tab, vertical tab, form feed, carriage return, and linefeed characters), Zs, Zl, Zp, and NEL(U+0085))
`normalize`	Returns `input` string normalized to the chosen form. Form C is used by default.
`sicmp`	Does basic case-insensitive comparison of strings `str1` and `str2`. This function uses simpler comparison rule thus achieving better performance than `icmp`. However keep in mind the warning below.
`toDelegate`	Builds a `Trie` with typically optimal speed-size trade-off and wraps it into a delegate of the following type: `bool delegate(dchar ch)`.
`toLower`	Returns a string which is identical to `s` except that all of its characters are converted to lowercase (by preforming Unicode lowercase mapping). If none of `s` characters were affected, then `s` itself is returned.
`toLower`	If `c` is a Unicode uppercase , then its lowercase equivalent is returned. Otherwise `c` is returned.
`toLowerInPlace`	Converts `s` to lowercase (by performing Unicode lowercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. If `s` does not have any uppercase characters, then `s` is unaltered.
`toTrie`	Convenience function to construct optimal configurations for packed Trie from any `set` of .
`toUpper`	If `c` is a Unicode lowercase , then its uppercase equivalent is returned. Otherwise `c` is returned.
`toUpper`	Returns a string which is identical to `s` except that all of its characters are converted to uppercase (by preforming Unicode uppercase mapping). If none of `s` characters were affected, then `s` itself is returned.
`toUpperInPlace`	Converts `s` to uppercase (by performing Unicode uppercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. If `s` does not have any lowercase characters, then `s` is unaltered.
`utfMatcher`	Constructs a matcher `object` to classify from the `set` for encoding that has `Char` as code unit.

Structs

Name	Description
`CodepointInterval`	The recommended type of std.typecons.Tuple to represent [a, b) intervals of . As used in `InversionList`. Any interval type should pass `isIntegralPair` trait.
`Grapheme`	A structure designed to effectively pack of a .
`InversionList`	`InversionList` is a set of represented as an array of open-right [a, b) intervals (see `CodepointInterval` above). The name comes from the way the representation reads left to right. For instance a set of all values [10, 50), [80, 90), plus a singular value 60 looks like this:
`MatcherConcept`	Conceptual type that outlines the common properties of all UTF Matchers.
`unicode`	A single entry point to lookup Unicode sets by name or alias of a `block`, `script` or general category.

Enums

Name	Description
`NormalizationForm`	Enumeration type for normalization forms, passed as template parameter for functions like `normalize`.
`UnicodeDecomposition`	Unicode character decomposition type.

Templates

Name	Description
`codepointSetTrie`	A shorthand for creating a custom multi-level fixed Trie from a `CodepointSet`. `sizes` are numbers of bits per level, with the most significant bits used first.
`codepointTrie`	A slightly more general tool for building fixed `Trie` for the Unicode data.
`isCodepointSet`	Tests if T is some kind a set of code points. Intended for template constraints.

Enum values

Name	Type	Description
`isIntegralPair`		Tests if `T` is a pair of integers that implicitly convert to `V`. The following code must compile for any pair `T`:
`isUtfMatcher`		Test if `M` is an UTF Matcher for ranges of `Char`.
`lineSep`		Constant (0x2028) - line separator.
`nelSep`		Constant (0x0085) - next line.
`NFC`		Shorthand aliases from values indicating normalization forms.
`NFD`		Shorthand aliases from values indicating normalization forms.
`NFKC`		Shorthand aliases from values indicating normalization forms.
`NFKD`		Shorthand aliases from values indicating normalization forms.
`paraSep`		Constant (0x2029) - paragraph separator.

Aliases

Name	Type	Description
`CodepointSet`	`InversionList!(std.uni.GcPolicy)`	The recommended default type for set of . For details, see the current implementation: `InversionList`.
`CodepointSetTrie`	`typeof(TrieBuilder!(bool,dchar,lastDchar+1,Prefix)(false).build())`	Type of Trie generated by `codepointSetTrie` function.
`CodepointTrie`	`typeof(TrieBuilder!(T,dchar,lastDchar+1,Prefix)(T.init).build())`	Type of Trie as generated by `codepointTrie` function.

Module `std.uni`

Synopsis

Terminology

Normalization

Construction of lookup tables

Unicode properties

References

Trademarks

Standards

Functions

Structs

Enums

Templates

Enum values

Aliases

Authors

License

Comments

Module std.uni

Synopsis

Terminology

Normalization

Construction of lookup tables

Unicode properties

References

Trademarks

Standards

Functions

Structs

Enums

Templates

Enum values

Aliases

Authors

License

Comments

Module `std.uni`