Introduction to Regular Expressions
What is Regex?
Regular Expressions (regex) are symbolic notations that are used to identify patterns in text.
What can I do with Regex?
Regex is useful for identifying strings, patterns of characters, words of particular interest. After your set of strings are found, you can either print it out, replace it, or perform other manipulative tasks.
Why should I learn Regex?
Knowing how to write and use regex provide the tools to quickly identify, count and manipulate strings of interest. Furthermore, regex gives you the flexibility as a programmer to apply tasks such as "find and replace" in a matter of milliseconds that would otherwise takes minutes to perform manually.
Looking to learn regex for the shell?
If you're looking to learn regular expressions for the shell or the Linux command line, head on over to our Command Line Regular Expressions tutorial page.
Matching Characters and Digits
One digit
Match any digit from 0 to 9 using \d
.
To match any non-digit, use the upper-case \D
.
One character
Use the dot (.
). This will match any character - a letter, digit or whitespace. To use the actual period scape with a backslash \.
.
One alphanumeric character
Any alphanumeric character including the underscore. Use \w
. As you'll learn soon, this is the same as [a-zA-Z0-9_]
.
Any non-alphanumeric character is \W
.
This is helpful especially for matching punctuations.
Boundary between word and non-word
\b
matches the boundary between a word and a non-word character.
Matching one from set of characters
Place each character within square brackets.
[abc]
- Match
a
,b
orc
.
Excluding matches to set of specific characters
Use the square brackets, just like above, with a hat (^
) inside of it.
[^abc]
- Match any character except for
a
,b
orc
.
Range between two letters or two digits
There is a shortcut to specifying any range of letters or digits. Instead of writing [abcdefghijklmnopqrstuvwxyz]
we can just write write the range between two letter or numbers, separated by a dash (-
).
[0-4]
- Matches digits from
0
to4
. [a-z]
- All lowercase letters from
a
toz
. [A-Z]
- All uppercase letters from
A
toZ
.
Whitespace
The whitespace special character \s
will match single type of whitespace - a space, tab, newline or carriage return.
To specify any number of whitespace character use the \s*
.
Any non-whitespace character will be \S
.
Repetitions and Optionals
Reiterating a single regex expression
If we want to reiterate a single regex expression, we can do so with braces ({}
). Within the braces we can specify up to two numbers, separated by a comma.
{n}
- Match preceding element (if it occurs) exactly n times.
{n,m}
- at least n times, but no more than m times.
{n,}
- n or more times.
{,m}
- no more than m times.
Here are some concrete examples:
.{4,}
- Match any character 4 or more times.
\d{5,10}
- Match any digit between 5 and 10 times.
To match an element zero or many times, use the Kleene star, which is an asterisk (*
) placed after a specific character.
Match at least one element
To match a previous element at least once, use the Kleene plus (+
). This is a tighter regulation than the Kleene star, as it ensures there is at least one of the preceding character.
Optional
To match an element zero or just one time, use the question mark (?
). To match an actual question mark, escape it. (\?
).
\d*
- Match zero or many digits.
[a-z]+
- Match one or many lowercase alphabet letters.
[abc]?
- Match the characters
a
,b
orc
zero or just once.
Conditionals
You can use conditionals to specify either-or relations.
Use the logical or (|
) between two options.
(cats|dogs)
- Selects either cats or dogs.
Anchors begins and ends with
In some cases, you'll need to specify that a string starts and/or ends with certain characters. For example, if you were reading from an ouput, and had to match if it reads "success", you couldn't just write a regex that matches "success", as "not success" and "success did not occur" would also be valid matches.
Begins and Ends with
To specify that a string should strictly start or end with certain character, we can use the hat (^
) or dollar sign ($
).
^zip
- Match all strings that start with
zip
. zip$
- Match all strings that end with
zip
. ^z..i.p$
- Match all strings that start with
z
, has any two characters then ani
, then another character, and ends withp
.
Capture
Groups
Groupings are used when you want to match certain patterns, but only want to extract part of the information.
We can later reference these segments that we captured later on.
To specify a grouping, use parantheses (()
).
([cat])\s\1
- Matches any string in text where "c c", "a a" or "t t"
Nested groups
You can also nest capture parentheses. The captured groups are in the order in which they are defined.
Referencing
To reference the captured substrings, use a backslash (\
).
\0
- matches full text
\1
- group 1
\2
- group 2