Regular expressions
A regular expression (or regex) is a string of characters, (some of which being reserved control characters,) which represent a pattern [1], i.e. a string designed to match a particular sequence of characters. Regular expressions provide the basic tool in searching, and are ubiquitous in the electronic world.
Getting started
editThere are many editors with regex functionalities. Here are a few examples (Please feel free to add or remove if you find better ones.)
- Regex tester - try your hand at regex here
- Regex101 - compose and test your regex
- meta:User:Pathoschild/Scripts/Regex menu framework - a simple and useful wiki-editing javascript
- Codeproject
- [1] - a useful editor with regex functionality
- Geany editor - a flexible, extensible free and open source editor that supports regex for search and replace operations
- Regexps manual - Emacs regular expression manual
- Regex tester - a firefox add-on
Learning materials
editA lightning introduction
editThere are several "dialects" (e.g. javascript, perl, php, python) of regular exprssions which differ slightly in grammar. Let us focus on python regex for the moment (because I happen to have a reference [2] for it).
Control characters
edit- Python regex has the control characters :
\-.*+?$<!=|()[]^:#
First examples
edit[please verify]
- Any string (e.g.
abcdefg
)which does not contain any control characters is trivially a regular expression ("regex") pattern. It matches only itself - The pattern
[A-Z]
matches a character between A and Z (in the ASCII table) - A backslash (\) followed by any control character, such as
\.
or even the backslash itself\\
, match the character itself (this pattern is called an "escape"). In our examples, \. matches the single dot . and \\ matches the backslash - Combining the two examples above, the pattern
[A-Za-z0-9\-]
matches any single alphanumeric character or the dash "-". - The pattern
\n
matches a newline - The pattern
abc.xyz
matches a string that starts with abc, then contains any character except an end-of-line character, then ends with xyz - The pattern
a*
matches a string with as many characters "a" as possible; it also matches the empty string "". - Combining the previous two examples, we get a very common pattern:
abc.*xyz
matches a string which starts and ends with "abc" and "xyz" respectively, and between which is the longest available string (which could be empty) of any character except the newline.
Exercises
edit- Question: What is
[A-Za-z0-9\-]
? - Write a regular expression to match (a) the URL of any wikiversity page; (b) the URL for any page on any wikimedia site, and (c) the email address of all your friends. Check with a regex editor that your regex actually works.
Write your proposed solutions below
editFurther lessons
edit[proposals]
- /Basics - the bare minimum to get one start working
- /Groups
- /How a regex engine works
- /Lookahead and lookbehind
- /Regex objects in python
- /The good and the bad
- /Cookbook
Wikimedia links
edit- b:regular expressions
- w:regular expressions
- mediawiki:titleblacklist - an application on wikiversity
External links
editNotes
edit- ↑ Martelli, Python in a nutshell, p.203
- ↑ Alex Martelli, Python in a nutshell ISBN 0596100469