Regex for Python developers. Introduction | by Arnav Goel

Literal Characters

The most basic regular expression consists of a single literal character, such as a. It matches the first occurrence of that character in the string. There are some other non-printable characters to keep in mind. The list below is not comprehensive and we will talk about more non-printable characters in the future:

t      tab character (ASCII 0x09)
r      carriage return (ASCII 0x0D)
n      line feed (ASCII 0x0A)
d      Matches any Unicode decimal digit. Includes [0-9] many other digit characters.
D      Matches any character which is not a decimal digit. Opposite of /d
s      Matches all whitespace characters
S      Matches characters that are not whitespace.
w      Matches word characters including alphanumeric characters.
If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
W      Matches any character which is not a word character. 
This is the opposite of w.
b      word boundary
B      Not a word boundary
^       Beginning of a string
$       End of a string
A      Matches only at the start of the string.
Z      Matches only at the end of the string.

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"n" is a two-character string containing '' and 'n', while "n" is a one-character string containing a new line.

Special Characters

There are 12 special characters that are reserved so that we can access more patterns in regex. These special characters are often called “metacharacters”. Most of them are errors when used alone.

. (period or dot)    In the default mode, this matches any character 
except a new line. If the DOTALL flag has been specified, 
this matches any character including a newline.

 (backslash)        Either escapes special characters (permitting you to match characters like '*', '?'), 
or signals a special sequence^ (caret)            Matches the start of the string, and in MULTILINE mode 
also matches immediately after each newline.$ (dollar sign)      Matches the end of the string or just before the 
newline at the end of the string, and in MULTILINE 
mode also matches before a newline| (vertical bar)     A|B, where A and B can be arbitrary REs, creates a regular 
expression that will match either A or B.? (question mark)    Causes the resulting RE to match 0 or 1 repetitions of 
the preceding RE. ab? will match either ‘a’ or ‘ab’.* (asterisk)         Causes the resulting RE to match 0 or more repetitions 
of the preceding RE, as many repetitions as are possible. 
ab* will match ‘a’, ‘ab’, or ‘a’ followed by ‘b’s.+ (plus sign)        Causes the resulting RE to match 1 or more repetitions of 
the preceding RE. ab+ will match ‘a’ followed by any 
non-zero number of ‘b’s; it will not match just ‘a’.(,) (parenthesis)    Matches whatever regular expression is inside the 
parentheses, and indicates the start and end of a group; 
the contents of a group can be retrieved after a match 
has been performed, and can be matched later in the string[ (square bracket)   Used to indicate a set of characters{ (curly brace)      Specifies the number of copies of previous RE to be matched;

Quantifiers — * + ? and {}

* match 0 or more repetitions of the preceding RE
+. match 1 or more repetitions of the preceding RE
? causes the resulting RE to match 0 or 1 repetitions of the preceding RE
{m} Specifies that exactly m copies of the previous RE should be matched
{m, n} causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible

abc*        matches a string that has ab followed by zero or more c
abc+        matches a string that has ab followed by one or more c
abc?        matches a string that has ab followed by zero or one c
abc{2}      matches a string that has ab followed by 2 c
abc{2,}     matches a string that has ab followed by 2 or more c
abc{2,5}    matches a string that has ab followed by 2 up to 5 c
a(bc)*      matches a string that has a followed by zero or more copies of the sequence bc
a(bc){2,5}  matches a string that has a followed by 2 up to 5 copies of the sequence bc

Greedy and Lazy match

The quantifiers ( * + {}) are greedy operators, so they expand the match as far as they can through the provided text.

For example, <.+> matches <div>simple div</div> in This is a <div> simple div</div> test.

In order to catch only the div tag we can use a ? to make it lazy:

<.+?> matches any character one or more times included
inside < and >

Notice that a better solution should avoid the usage of . in favor of a more strict regex:

<[^<>]+> same as <.+?> but stricter

Anchors — ^ and $

^The        matches any string that starts with "The"
end$        matches a string that ends with "end"
^The end$   starts with "The end" and ends with "The end" (exact match)
roar        matches any string that has the text roar in it

A good example could be to check if there are decimal numbers in the string ^d+$. d+ says to match that all the numbers should be decimals but that would match abcd4cd as well. So we want to add ^ before that and $ after that to make it an exact match.

Grouping and capturing

Parenthesis examples — Matches the exact regular expression inside the parentheses

1. a(bc)           parentheses create a capturing group with value bc
2. a(?:bc)*        using ?: we disable the capturing group
3. a(?<foo>bc)     using ?<foo> we put a name to the group

Bracket expressions — Returns true if RE matche one of the regular expressions inside the brackets.

[abc]            matches a string that has either an a or ab or ac
[a-c]            same as previous
[a-fA-F0-9]      a string that represents a single hexadecimal digit, case insensitively
[0-9]%           a string that has a character from 0 to 9 before a % sign
[^a-zA-Z]        a string that has not a letter from a to z or from A to Z. (^ is used as negation)

Boundaries — b and B

b represents an anchor like caret (it is similar to $ and ^) matching positions where one side is a word character (like w) and the other side is not a word character (for instance it may be the beginning of the string or a space character).

babcb      performs a "whole words only" search

It comes with its negation, B. This matches all positions where b doesn’t match and could be if we want to find a search pattern fully surrounded by word characters.

BabcB      matches only if pattern is fully surrounded by word characters

Source link

Leave a Reply Cancel reply

Related Stories

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

VC-Dimension V.S. Inductive Bias V.S. Biology V.S. Physical Laws : Comprehensive Multi-Disciplinary Table of Machine Learning Classifiers | by Medium_AI_CS_ML | Apr, 2024

Why Machine Learning Is Worth Talking About? | by jupytermishra | Apr, 2024

You may have missed

The Weekly Reorg: Bitcoin Fashion Week

Virtual curating frees artist – Hypergrid Business

Different types of artificial intelligence (AI) | by Robert Ishimura Sousa | Apr, 2024

Azteco Is Helping Millions Buy Bitcoin Without Sharing Their Identity