Unicode property escapes

Draft
This page is not complete.

Unicode property escapes in Regular Expressions allows for matching unicode characters based on unicode character properties. It allows for distinguishing between types of characters such as upper and lower case letters, math symbols, and punctuation. A feature learnt from PCRE et al, it is available for the RegExp object since ES2018.

Syntax

// Non-binary values
\p{UnicodePropertyValue}
\p{UnicodePropertyName=UnicodePropertyValue}

// Binary and Non-binary values
\p{UnicodeBinaryPropertyName}

// \P is Negated \p
\P{UnicodePropertyValue}
\P{UnicodeBinaryPropertyName}
UnicodePropertyName
One of General_Category (gc), Script (sc), or  Script_Extensions (scx).
UnicodePropertyValue
One of the tokens listed in the Values section, below. Many values have a shorthand. For most values, the UnicodePropertyName part and equals sign may be omitted. If a UnicodePropertyName is specified, the value must correspond to the property type given.
UnicodeBinaryPropertyName
One of the binary names listed below.

 Values

This section is currently missing many aliases, and is in general incorrect with regards to its use of UnicodePropertyName. Please refer to the specification instead.

Non-binary

Escape Meaning
\p{LC}
\p{Cased_Letter}
\p{UnicodePropertyName=Cased_Letter}
Any letter with both lower case and upper case variants. This is equivalent to \p{Lu}|\p{Ll}|p{Lt}.
\p{Close_Punctuation}
\p{UnicodePropertyName=Close_Punctuation}
\p{Connector_Punctuation}
\p{UnicodePropertyName=Connector_Punctuation}
\p{Control}
\p{UnicodePropertyName=Control}
\p{Currency_Symbol}
\p{UnicodePropertyName=Currency_Symbol}
\p{Dash_Punctuation}
\p{UnicodePropertyName=Dash_Punctuation}
\p{Decimal_Number}
\p{UnicodePropertyName=Decimal_Number}
\p{Enclosing_Mark}
\p{UnicodePropertyName=Enclosing_Mark}
\p{Final_Punctuation}
\p{UnicodePropertyName=Final_Punctuation}
\p{Format}
\p{UnicodePropertyName=Format}
\p{Initial_Punctuation}
\p{UnicodePropertyName=Initial_Punctuation}
\p{Letter}
\p{UnicodePropertyName=Letter}

\p{Letter_Number}
\p{UnicodePropertyName=Letter_Number}

\p{Line_Separator}
\p{UnicodePropertyName=Line_Separator}
\p{Lowercase_Letter}
\p{UnicodePropertyName=Lowercase_Letter}
\p{Mark}
\p{UnicodePropertyName=Mark}
\p{Math_Symbol;}
\p{UnicodePropertyName=Math_Symbol}
\p{Modifier_Letter}
\p{UnicodePropertyName=Modifier_Letter}
\p{Modifier_Symbol}
\p{UnicodePropertyName=Modifier_Symbol}
\p{Nonspacing_Mark}
\p{UnicodePropertyName=Nonspacing_Mark}
\p{Number}
\p{UnicodePropertyName=Number}
\p{Open_Punctuation}
\p{UnicodePropertyName=Open_Punctuation}
\p{Other}
\p{UnicodePropertyName=Other_Letter}
\p{Other_Letter}
\p{UnicodePropertyName=Other_Letter}
\p{Other_Number}
\p{UnicodePropertyName=Other_Number}
\p{Other_Punctuation}
\p{UnicodePropertyName=Other_Punctuation}
\p{Paragraph_Separator}
\p{UnicodePropertyName=Paragraph_Separator}
\p{Private_Use}
\p{UnicodePropertyName=Private_Use}
\p{Punctuation}
\p{UnicodePropertyName=Punctuation}
\p{Separator}
\p{UnicodePropertyName=Separator}
\p{Space_Separator}
\p{UnicodePropertyName=Space_Separator}
\p{Spaceing_Mark}
\p{UnicodePropertyName=Spacing_Mark}
\p{Surrogate}
\p{UnicodePropertyName=Surrogate}
\p{Symbol}
​​​​​​​\p{UnicodePropertyName=Symbol}
\p{Titlecase_Letter}
​​​​​​​\p{UnicodePropertyName=Titlecase_Letter}
\p{Unassigned}
​​​​​​​\p{UnicodePropertyName=Unassigned}
\p{Uppercase_Letter}
​​​​​​​\p{UnicodePropertyName=UppercaseLetter}

Binary

Escape Meaning
\p{Alphabetic}
\p{Bidi_Control}
\p{Bidi_Mirrored}
\p{Case_Ignorable}
\p{Cased}
\p{Changes_When_Casefolded}
\p{Changes_When_Casemapped}
\p{Changes_When_Lowercased}
\p{Changes_When_NFKC_Casefolded}
\p{Changes_When_Titlecased}
\p{Changes_When_Uppercased}
\p{Dash}
\p{Default_Ignorable_Code_Point}
\p{Deprecated}
\p{Diacritic}
\p{Emoji}
\p{Emoji_Component}
\p{Emoji_Modifier}
\p{Emoji_Modifier_Base}
\p{Emoji_Presentation}
\p{Extender}
\p{Grapheme_Base}
\p{Grapheme_Extend}
\p{Hex_Digit}
\p{ID_Continue}
\p{ID_Start}
\p{Ideographic}
\p{IDS_Binary_Operator}
\p{IDS_Trinary_Operator}
\p{Join_Control}
\p{Logical_Order_Exception}
\p{Lowercase}
\p{Math}
\p{Noncharacter_Code_Point}
\p{Pattern_Syntax}
\p{Pattern_White_Space}
\p{Quotation_Mark}
\p{Radical}
\p{RegionalIndicator}
\p{Sentence_Terminal}
\p{Soft_Dotted}
\p{Terminal_Punctuation}
\p{Unified_Ideograph}
\p{Uppercase}
\p{Variation_Selector}
\p{White_Space}
\p{XID_Continue}
\p{XID_Start}

Examples

  • A regular expression that checks for valid Unicode usernames: \p{Letter}\p{Letter_Number}+​​​​​​​

Specifications

Specification Status Comment
ECMAScript 2018 (ECMA-262)
The definition of 'UnicodeMatchProperty' in that specification.
Standard Initial definition, based on proposal-regexp-unicode-property-escapes.

Browser compatibility

Polyfill

The proposal mentions many solutions prior to the native support for this feature. XRegExp provides an expanded JavaScript with this feature. However, due to the vast amount of data requied, it could be more economical to do so at build-time with libraries like Regenerate. Unicode.org has a page somewhere that offers a similar conversion.

See also