Regexp - Word Characters



A word can be represented by the shorthand class (\w) and is specified as:

  • any letter (ie the class [A-Za-z])
  • or any digit (ie the class [0-9])
  • or the underscore character (ie the [_])

It would then be expressed as the following class [0-9A-Za-z_].

That is, any character which can be part of a Perl “word”.

Definition of letters and digits versus Character Set

The definition of letters and digits is controlled by character tables, and may vary if locale-specific matching is taking place.

For example, in the “fr” (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.


Regular Expression - Boundary Matcher

Word boundary

A word boundary \b is a zero-width assertion that matches if:

  • there is \w on one side, and either there is \W (non-word char) on the other
  • or the position is beginning or end of string.

Example 1:

  • The regex: \bdog\b
  • Input String to search: The dog plays in the yard
  • Result: Found the text dog starting at index 4 and ending at index 7.
  • The regex: \bdog\b
  • Input string to search: The doggie plays in the yard.
  • No match found.

Non-word boundary

A non-word boundary is \B.

Example 1:

  • The regex: \bdog\B
  • Input string to search: The dog plays in the yard.
  • No match found.

Example 2:

  • The regex: \bdog\B
  • Input string to search: The doggie plays in the yard.
  • I found the text dog starting at index 4 and ending at index 7.

Documentation / Reference

Discover More

is an extension of regular expressions that supports expressions as variables (so they can be reused) In this example, we will construct an expression that matches the part of a string time expression....
Multilingual Regular Expression Syntax (Pattern)

Regular expression are Expression that defines a pattern in text. This is therefore a language that permits to define structure of a text. They are a mathematically-defined concept, invented by Stephen...
Regular Expression - Backslash Generic Character Class (shorthand)

The use of backslash are called shorthand) and specifies: a generic character class of characters or a single character (ie matcher) The syntax of a shorthand in a regular pattern. where: ...
Regular Expression - Boundary Matcher

The boundary matcher meta Symbol Description ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word boundary \A The beginning of the input \G The end of the previous...
Card Puncher Data Processing
Shell Data Processing - (WC|Word Count) command (Line count)

The wc command is a filter that prints on one line sequentially the number of: newlines (lines), words, and characters from: files or from an input stream when: no FILE is specified or...
What are the attributes with whitespace-separated list of words ? ( HTML /XML)

Some attributes have as value a whitespace-separated list of words. This attributes are specials because they allow special word functions. You can make selection based on a word with the tilde equality...

Share this page:
Follow us:
Task Runner