Regexp - Character Class (Character Set)

About

A character class defines a domain of permitted characters.

They may be also known as character set (in a regular expression)

Not to confound with the character set used to encode a text into bit but you can represent a whole character set with a character class.

For instance, if your regular expression engine supports it, you could represent all ASCII characters with

[:ASCII:]

Syntax

with square brackets

[]

where:

  • [ is the start character class definition
  • ] is the end character class definition

Inside the brackets, all characters can be used mixed with this Meta characters symbols

Construct Matches Operation
[abc] a, b, or c simple class
[^abc] Any character except a, b, or c negation
[a-zA-Z] a through z or A through Z, inclusive range
[a-d[m-p]] a through d, or m through p: [a-dm-p] union
[a-z&&[def]] d, e, or f intersection
[a-z&&[^bc]] a through z, except for b and c: [ad-z] subtraction
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] subtraction

Other:

[..] Specifies one collation element, and can be a multicharacter element [.ch.] in Spanish
[:characterClass:] Specifies character classes. It matches any character within the character class. [:alpha:] See posix
[==] Specifies equivalence classes. [=a=] matches all characters having base letter 'a'.

Meta

Meta-character Description Example
\ general escape character
^ negate the class, but only if the first character [^abc] matches any character other than a, b, or c.
[^a-z] matches any single character that is not a lowercase letter from a to z.
- indicates character range [abc] matches a, b, or c. [a-z] specifies a range which matches any lowercase letter from a to z.
These forms can be mixed:
[abcx-z] matches a, b, c, x, y, and z,
as does [a-cx-z]

Type

POSIX

Since many ranges of characters depend on the chosen locale setting (i.e., in some settings letters are organized as abc…zABC…Z, while in some others as aAbBcC…zZ), the POSIX standard defines some classes or categories of characters as shown in the following table:

POSIX ASCII Description
[:alnum:] [A-Za-z0-9] Alphanumeric characters
[:alpha:] [A-Za-z] Alphabetic characters
[:lower:] [a-z] Lowercase letters
[:upper:] [A-Z] Uppercase letters
[:blank:] [ \s\t] Space and tab
[:cntrl:] [\x00-\x1F\x7F] Control characters
[:digit:] [0-9] Digits
[:graph:] [\x21-\x7E] Visible characters
[:print:] [\x20-\x7E] Visible characters and spaces
[:punct:] [-!"#$%&'()*+,./:;<=>[email protected][\\\]_`{|}~] Punctuation characters
[:space:] [ \t\r\n\v\f] Whitespace characters
[:xdigit:] [A-Fa-f0-9] Hexadecimal digits

Unicode Set

By default, the Unicode set will show you

Unicode Description
[:ASCII:] the set of ASCII characters
[:Lowercase:] the set of lowercase character
[:Lowercase_Letter:] the set of lowercase letter

Shorthand

The shorthand syntax begins with a slash followed by a letter. See Regular Expression - Backslash Generic Character Class (shorthand)


Powered by ComboStrap