Difference between revisions of "Introduction to Regular Expressions in Python"

From Sustainability Methods
 
(8 intermediate revisions by the same user not shown)
Line 12: Line 12:
  
 
* '''Character Classes''': Represent a set of characters that match any one character from the set (like '\d', '\w', '\s').
 
* '''Character Classes''': Represent a set of characters that match any one character from the set (like '\d', '\w', '\s').
 +
 +
===Use Cases===
 +
 +
* '''String Parsing''': Extracting specific information from text.
 +
* '''Data Validation''': Ensuring formats of data are correct (like emails, phone numbers, passwords).
 +
* '''Text Preprocessing''': Used in natural language processing for tasks like tokenization, cleaning data.
 +
 +
==Regular expressions` patterns==
 +
 +
===Special characters:===
 +
 +
. - any charachter, except newline (\n)<br>
 +
^ - start of the string or line<br>
 +
$ - end of the string<br>
 +
\* - the previous charachter repeats 0 or more times =={0,}<br>
 +
\+ - the previous charachter repeats 1 or more times =={1,}<br>
 +
? - the presence of the previous charachter is not necessary =={0,1}<br> 
 +
{n, m} - the previous charachter repeats from n till m times<br> 
 +
{n} - the previous charachter repeats exactly n times<br>
 +
 +
===Range of charachters:===
 +
 +
[A-Za-z] - any letter of alphabet in lower- and uppercases<br>
 +
[^0-9] - any charachter, except digits<br>
 +
 +
===Metacharachters, which represent a class:===
 +
 +
\d - matches digits<br>
 +
\s - matches any whitespace character, like spaces and tabs<br>
 +
\w - matches any word character, which includes letters (both uppercase and lowercase), digits, and underscores
 +
 +
===Combinations===
 +
 +
[\w\s]+ - The combination [\w\s] matches any word character or space, and the + means one or more occurrences of this combination.
 +
 +
([\w\s]+?) - This is a capture group, which means we are trying to extract whatever matches this pattern.
 +
 +
The ? after the + makes the + "lazy" instead of "greedy". In regular expressions, greedy means matching as much as possible, while lazy means matching as little as possible. So [\w\s]+? will match the shortest sequence of word characters and spaces at the start of the string.
 +
 +
[[Category:Statistics]]
 +
[[Category:Python basics]]

Latest revision as of 12:58, 14 May 2024

THIS ARTICLE IS STILL IN EDITING MODE

Introduction

Regular expressions are sequences of characters that form a search pattern, mainly used for string searching and manipulation. In Python, they are implemented through the built-in 're' module.

Basic Concepts

  • Patterns: Specific sequences of characters that represent a search criteria.
  • Metacharacters: Special characters that signify broader types of patterns (like '*', '+', '?').
  • Character Classes: Represent a set of characters that match any one character from the set (like '\d', '\w', '\s').

Use Cases

  • String Parsing: Extracting specific information from text.
  • Data Validation: Ensuring formats of data are correct (like emails, phone numbers, passwords).
  • Text Preprocessing: Used in natural language processing for tasks like tokenization, cleaning data.

Regular expressions` patterns

Special characters:

. - any charachter, except newline (\n)
^ - start of the string or line
$ - end of the string
\* - the previous charachter repeats 0 or more times =={0,}
\+ - the previous charachter repeats 1 or more times =={1,}
? - the presence of the previous charachter is not necessary =={0,1}
{n, m} - the previous charachter repeats from n till m times
{n} - the previous charachter repeats exactly n times

Range of charachters:

[A-Za-z] - any letter of alphabet in lower- and uppercases
[^0-9] - any charachter, except digits

Metacharachters, which represent a class:

\d - matches digits
\s - matches any whitespace character, like spaces and tabs
\w - matches any word character, which includes letters (both uppercase and lowercase), digits, and underscores

Combinations

[\w\s]+ - The combination [\w\s] matches any word character or space, and the + means one or more occurrences of this combination.

([\w\s]+?) - This is a capture group, which means we are trying to extract whatever matches this pattern.

The ? after the + makes the + "lazy" instead of "greedy". In regular expressions, greedy means matching as much as possible, while lazy means matching as little as possible. So [\w\s]+? will match the shortest sequence of word characters and spaces at the start of the string.