![](https://crypto4nerd.com/wp-content/uploads/2023/06/0T2ywHX7_cY0uGJkU-1024x686.jpg)
Literal Characters
The most basic regular expression consists of a single literal character, such as a. It matches the first occurrence of that character in the string. There are some other non-printable characters to keep in mind. The list below is not comprehensive and we will talk about more non-printable characters in the future:
t tab character (ASCII 0x09)
r carriage return (ASCII 0x0D)
n line feed (ASCII 0x0A)
d Matches any Unicode decimal digit. Includes [0-9] many other digit characters.
D Matches any character which is not a decimal digit. Opposite of /d
s Matches all whitespace characters
S Matches characters that are not whitespace.
w Matches word characters including alphanumeric characters.
If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
W Matches any character which is not a word character.
This is the opposite of w.
b word boundary
B Not a word boundary
^ Beginning of a string
$ End of a string
A Matches only at the start of the string.
Z Matches only at the end of the string.
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'
. So r"n"
is a two-character string containing ''
and 'n'
, while "n"
is a one-character string containing a new line.
Special Characters
There are 12 special characters that are reserved so that we can access more patterns in regex. These special characters are often called “metacharacters”. Most of them are errors when used alone.
. (period or dot) In the default mode, this matches any character
except a new line. If the DOTALL flag has been specified,
this matches any character including a newline.
(backslash) Either escapes special characters (permitting you to match characters like '*', '?'),
or signals a special sequence^ (caret) Matches the start of the string, and in MULTILINE mode
also matches immediately after each newline.$ (dollar sign) Matches the end of the string or just before the
newline at the end of the string, and in MULTILINE
mode also matches before a newline| (vertical bar) A|B, where A and B can be arbitrary REs, creates a regular
expression that will match either A or B.? (question mark) Causes the resulting RE to match 0 or 1 repetitions of
the preceding RE. ab? will match either ‘a’ or ‘ab’.* (asterisk) Causes the resulting RE to match 0 or more repetitions
of the preceding RE, as many repetitions as are possible.
ab* will match ‘a’, ‘ab’, or ‘a’ followed by ‘b’s.+ (plus sign) Causes the resulting RE to match 1 or more repetitions of
the preceding RE. ab+ will match ‘a’ followed by any
non-zero number of ‘b’s; it will not match just ‘a’.(,) (parenthesis) Matches whatever regular expression is inside the
parentheses, and indicates the start and end of a group;
the contents of a group can be retrieved after a match
has been performed, and can be matched later in the string[ (square bracket) Used to indicate a set of characters{ (curly brace) Specifies the number of copies of previous RE to be matched;
Quantifiers — * + ? and {}
- * match 0 or more repetitions of the preceding RE
- +. match 1 or more repetitions of the preceding RE
- ? causes the resulting RE to match 0 or 1 repetitions of the preceding RE
- {m} Specifies that exactly m copies of the previous RE should be matched
- {m, n} causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible
abc* matches a string that has ab followed by zero or more c
abc+ matches a string that has ab followed by one or more c
abc? matches a string that has ab followed by zero or one c
abc{2} matches a string that has ab followed by 2 c
abc{2,} matches a string that has ab followed by 2 or more c
abc{2,5} matches a string that has ab followed by 2 up to 5 c
a(bc)* matches a string that has a followed by zero or more copies of the sequence bc
a(bc){2,5} matches a string that has a followed by 2 up to 5 copies of the sequence bc
Greedy and Lazy match
The quantifiers ( * + {}
) are greedy operators, so they expand the match as far as they can through the provided text.
For example, <.+>
matches <div>simple div</div>
in This is a <div> simple div</div> test
.
In order to catch only the div
tag we can use a ?
to make it lazy:
<.+?> matches any character one or more times included
inside < and >
Notice that a better solution should avoid the usage of .
in favor of a more strict regex:
<[^<>]+> same as <.+?> but stricter
Anchors — ^ and $
^The matches any string that starts with "The"
end$ matches a string that ends with "end"
^The end$ starts with "The end" and ends with "The end" (exact match)
roar matches any string that has the text roar in it
A good example could be to check if there are decimal numbers in the string ^d+$.
d+ says to match that all the numbers should be decimals but that would match abcd4cd as well. So we want to add ^ before that and $ after that to make it an exact match.
Grouping and capturing
- Parenthesis examples — Matches the exact regular expression inside the parentheses
1. a(bc) parentheses create a capturing group with value bc
2. a(?:bc)* using ?: we disable the capturing group
3. a(?<foo>bc) using ?<foo> we put a name to the group
- Bracket expressions — Returns true if RE matche one of the regular expressions inside the brackets.
[abc] matches a string that has either an a or ab or ac
[a-c] same as previous
[a-fA-F0-9] a string that represents a single hexadecimal digit, case insensitively
[0-9]% a string that has a character from 0 to 9 before a % sign
[^a-zA-Z] a string that has not a letter from a to z or from A to Z. (^ is used as negation)
Boundaries — b and B
b
represents an anchor like caret (it is similar to $
and ^
) matching positions where one side is a word character (like w
) and the other side is not a word character (for instance it may be the beginning of the string or a space character).
babcb performs a "whole words only" search
It comes with its negation, B
. This matches all positions where b
doesn’t match and could be if we want to find a search pattern fully surrounded by word characters.
BabcB matches only if pattern is fully surrounded by word characters