Regular Expression Notes

Format string Into pattern

Regular expression is one of the basic skills for web crawler

pattern = re.compile(r'abc') #change string into object, speed up 
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)  #shows each match and its span

MetaCharacters that need to be escaped(add \ )

. ^ $ * + ? [ ] { } \ | ( )

The special characters

<!---
.       - Any Character Except New Line
\d      - Digit (0-9)
\D      - Not a Digit (0-9)
\w      - Word Character (a-z, A-Z, 0-9, _)
\W      - Not a Word Character
\s      - Whitespace (space, tab, newline)
\S      - Not Whitespace (space, tab, newline)

\b      - Word Boundary
\B      - Not a Word Boundary
^       - Beginning of a String
$       - End of a String

[]      - Matches Characters in brackets
[^ ]    - Matches Characters NOT in brackets
|       - Either Or
( )     - Group

Quantifiers:
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)
-->

The code block below shows how word boundary works

text_to_search = 'ha haha'
pattern = re.compile(r'\bha')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)  #this will only give you two ha

Some examples


<!---
\d\d\d               three digits in a row

[-.]                 matches a dash or a dot

[A-Za-z0-9]          still only match one character

[^A-Z]               ^inside [] negates the set and matches everything that is not in the set

\d{3}                matches three digits

M(r|s|rs)   	     matches Mr, Ms and Mrs      

(www.)?      	     the entire group is optional 

(www\.)?(\w+)(\\.\w+)      print(match.group(0)) shows the same as print(match), group(1) shows the first group
-->

urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

subbed_urls = pattern.sub(r'\2\3', urls) 
#everytime it finds a match, the match will be subsituded by 2nd and 3rd group

print(subbed_urls)

google.com
coreyms.com
youtube.com
nasa.gov

Add a flag to ignore case:

re.compile(r’s’, re.I)