Regexp - Word Characters

About

A word can be represented by the shorthand class (\w) and is specified as:

  • any letter (ie the class [A-Za-z])
  • or any digit (ie the class [0-9])
  • or the underscore character (ie the [_])

It would then be expressed as the following class [0-9A-Za-z_].

That is, any character which can be part of a Perl “word”.

Definition of letters and digits versus Character Set

The definition of letters and digits is controlled by character tables, and may vary if locale-specific matching is taking place.

For example, in the “fr” (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

Boundary

Word boundary

A word boundary \b is a zero-width assertion that matches if:

  • there is \w on one side, and either there is \W (non-word char) on the other
  • or the position is beginning or end of string.

Example 1:

  • The regex: \bdog\b
  • Input String to search: The dog plays in the yard
  • Result: Found the text dog starting at index 4 and ending at index 7.
  • The regex: \bdog\b
  • Input string to search: The doggie plays in the yard.
  • No match found.

Non-word boundary

A non-word boundary is \B.

Example 1:

  • The regex: \bdog\B
  • Input string to search: The dog plays in the yard.
  • No match found.

Example 2:

  • The regex: \bdog\B
  • Input string to search: The doggie plays in the yard.
  • I found the text dog starting at index 4 and ending at index 7.

Documentation / Reference


Powered by ComboStrap