Regexp - Character Class (Character Set)

Regexp

About

A character class defines a domain of permitted characters.

They may be also known as character set (in a regular expression)

Not to confound with the character set used to encode a text into bit but you can represent a whole character set with a character class.

For instance, if your regular expression engine supports it, you could represent all ASCII characters with

[:ASCII:]

Syntax

with square brackets

[]

where:

  • [ is the start character class definition
  • ] is the end character class definition

Inside the brackets, all characters can be used mixed with this Meta characters symbols

Construct Matches Operation
[abc] a, b, or c simple class
[^abc] Any character except a, b, or c negation
[a-zA-Z] a through z or A through Z, inclusive range
[a-d[m-p]] a through d, or m through p: [a-dm-p] union
[a-z&&[def]] d, e, or f intersection
[a-z&&[^bc]] a through z, except for b and c: [ad-z] subtraction
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] subtraction

Other:

[..] Specifies one collation element, and can be a multicharacter element [.ch.] in Spanish
[:characterClass:] Specifies character classes. It matches any character within the character class. [:alpha:] See posix
[==] Specifies equivalence classes. [=a=] matches all characters having base letter 'a'.

Meta

Meta-character Description Example
\ general escape character
^ negate the class, but only if the first character [^abc] matches any character other than a, b, or c.
[^a-z] matches any single character that is not a lowercase letter from a to z.
- indicates character range [abc] matches a, b, or c. [a-z] specifies a range which matches any lowercase letter from a to z.
These forms can be mixed:
[abcx-z] matches a, b, c, x, y, and z,
as does [a-cx-z]

Type

POSIX

Since many ranges of characters depend on the chosen locale setting (i.e., in some settings letters are organized as abc…zABC…Z, while in some others as aAbBcC…zZ), the POSIX standard defines some classes or categories of characters as shown in the following table:

POSIX ASCII Description
[:alnum:] [A-Za-z0-9] Alphanumeric characters
[:alpha:] [A-Za-z] Alphabetic characters
[:lower:] [a-z] Lowercase letters
[:upper:] [A-Z] Uppercase letters
[:blank:] [ \s\t] Space and tab
[:cntrl:] [\x00-\x1F\x7F] Control characters
[:digit:] [0-9] Digits
[:graph:] [\x21-\x7E] Visible characters
[:print:] [\x20-\x7E] Visible characters and spaces
[:punct:] [-!"#$%&'()*+,./:;<=>?@[\\\]_`{|}~] Punctuation characters
[:space:] [ \t\r\n\v\f] Whitespace characters
[:xdigit:] [A-Fa-f0-9] Hexadecimal digits

Unicode Set

By default, the Unicode set will show you

Unicode Description
[:ASCII:] the set of ASCII characters
[:Lowercase:] the set of lowercase character
[:Lowercase_Letter:] the set of lowercase letter

Shorthand

The shorthand syntax begins with a slash followed by a letter. See Regular Expression - Backslash Generic Character Class (shorthand)





Discover More
Data System Architecture
Character - Whitespace

White-space characters is a set of characters that contains: spaces, tabs, and line breaks They are part of the non printing characters. They may be a difference between the class: ...
Data System Architecture
Character Set - American Standard Code for Information Interchange ( ASCII )

ASCII is a character set originally based on the English alphabet, it encodes 128 specified characters into 7-bit binary integers (8 bits for the extended ASCII table) ASCII means American Standard Code...
Notepad Eol
Examples on how to replace a text in Notepad++ with regular expression

A step by step tutorial and snippets on how to replace a portion of text in notepad++ with regular expression
Scale Counter Graph
How to get Started With FluentBit

FluentBit from Calyptia is a log collector (ie observability pipeline tool) (written in C, that works on Linux and Windows). It's the Fluentd successor with smaller memory footprint When you need...
Javascript - String

The in javascript. A string in JavaScript is encoded with the ucs-2 16-bit character set. An element of a JavaScript string is therefore a 16-bit code unit. code unitscode pointssurrogate pair Strings...
Card Puncher Data Processing
Language - Comment

Comments are the code document. They should explain why, not what. They can optionally explain how if what’s written is particularly confusing. Javascript In javascript with the Reference/Global_Objects/String/replaceReplace...
Regexp
Multilingual Regular Expression Syntax (Pattern)

Regular expression are Expression that defines a pattern in text. This is therefore a language that permits to define structure of a text. They are a mathematically-defined concept, invented by Stephen...
Regexp
Regexp - (Quantifier|Cardinality Indicators)

A quantifier defines the number of times that: character class of character grouped (or not) may be seen. It has three behaviors: a Greedy one: match longest possible string. This is the...
Regexp
Regexp - Assertion (Condition)

An assertion is an assertion that specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. simple assertion designed...
Regexp
Regexp - Dot (Single Character pattern)

Dot . in a regular expression matches any character in the supported character set with this characteristic, by default: newline characters are not included without setting a flag. See matchnewline...



Share this page:
Follow us:
Task Runner