Regular Expression Notes

Format string Into pattern

Regular expression is one of the basic skills for web crawler

1
2
3
4
pattern = re.compile(r'abc') #change string into object, speed up 
matches = pattern.finditer(text_to_search)
for match in matches:
print(match) #shows each match and its span

MetaCharacters that need to be escaped(add \ )

. ^ $ * + ? [ ] { } \ | ( )

The special characters

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<!---
. - Any Character Except New Line
\d - Digit (0-9)
\D - Not a Digit (0-9)
\w - Word Character (a-z, A-Z, 0-9, _)
\W - Not a Word Character
\s - Whitespace (space, tab, newline)
\S - Not Whitespace (space, tab, newline)

\b - Word Boundary
\B - Not a Word Boundary
^ - Beginning of a String
$ - End of a String

[] - Matches Characters in brackets
[^ ] - Matches Characters NOT in brackets
| - Either Or
( ) - Group

Quantifiers:
* - 0 or More
+ - 1 or More
? - 0 or One
{3} - Exact Number
{3,4} - Range of Numbers (Minimum, Maximum)
-->

The code block below shows how word boundary works

1
2
3
4
5
text_to_search = 'ha haha'
pattern = re.compile(r'\bha')
matches = pattern.finditer(text_to_search)
for match in matches:
print(match) #this will only give you two ha

Some examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

<!---
\d\d\d three digits in a row

[-.] matches a dash or a dot

[A-Za-z0-9] still only match one character

[^A-Z] ^inside [] negates the set and matches everything that is not in the set

\d{3} matches three digits

M(r|s|rs) matches Mr, Ms and Mrs

(www.)? the entire group is optional

(www\.)?(\w+)(\\.\w+) print(match.group(0)) shows the same as print(match), group(1) shows the first group
-->
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

subbed_urls = pattern.sub(r'\2\3', urls)
#everytime it finds a match, the match will be subsituded by 2nd and 3rd group

print(subbed_urls)

google.com
coreyms.com
youtube.com
nasa.gov

Add a flag to ignore case:

re.compile(r’s’, re.I)