Learn Regular Expressions with this Free Course

Regular expressions, or "regex" for short, are an incredibly powerful tool that every developer should have in their toolkit. In a nutshell, they allow you to define concise patterns to match and manipulate text. With regex, you can validate user input, search and replace text, extract data from unstructured sources, and much more.

While they may look intimidating at first with all those strange symbols, once you understand the basic building blocks, you‘ll be surprised how quickly you can learn to harness their power. This free course will teach you everything you need to know to start being productive with regular expressions.

The Fundamentals of Regex

At the most basic level, a regular expression is just a fancy way to specify a search pattern. It‘s made up of two types of characters:

  1. Literal characters that match the exact character specified
  2. Special characters (also called "metacharacters") that have special meaning

For example, the regex pattern cat contains only literal characters and will match the exact text "cat" (case-sensitive). But the pattern c.t uses the special character . which matches any single character. So this would match "cat", "cot", "c t", and so on.

Here are a few of the most common special characters to get you started:

Character Meaning
. Matches any single character (except newline)
* Matches 0 or more of the preceding character
+ Matches 1 or more of the preceding character
? Matches 0 or 1 of the preceding character
^ Matches the start of the string
$ Matches the end of the string

The special characters ^ and $ are called "anchors" because they match a position rather than an actual character.

Let‘s look at a few examples to make this concrete:

/^cat/    Matches "cat" at the start of a string
/cat$/    Matches "cat" at the end of a string  
/c.t/     Matches "cat", "cot", "c1t", "c@t", etc.
/cat*/    Matches "ca", "cat", "catt", "cattt", etc.
/cat+/    Matches "cat", "catt", "cattt", etc. but not "ca"
/cat?/    Matches "ca" or "cat" but not "catt", "cattt", etc.

Play around with these patterns in your favorite programming language to get a feel for how they work. Most languages have built-in support for regular expressions.

Matching Sets of Characters

What if you want to match one of several characters in a particular position? This is where character classes come in handy. You define them by enclosing the allowable characters in square brackets.

For instance, the regex c[aou]t will match "cat", "cot", or "cut", but not "cet" or "cit". Inside the brackets, most characters (including the special characters we saw before) match themselves literally.

A couple more things to know about character classes:

  • You can specify a range of characters with a hyphen. So [a-z] matches any lowercase letter, [0-9] matches any digit, etc.
  • The caret ^ negates the class if it‘s the first character after the opening bracket. [^aeiou] matches any consonant (non-vowel) character.

Here are a few practical examples of character classes in action:

/[A-Z][a-z]*/          Matches strings like "Apple", "Boat", "Zebra", etc.
/[0-9]{3}-[0-9]{2}-[0-9]{4}/   Matches US Social Security numbers
/[+\-]?(\d+(\.\d*)?|\.\d+)/    Matches floating point numbers with optional sign  

Numeric ranges are especially useful for validating structured strings like ID numbers, dates, etc. More on that in a bit.

Quantifiers: How Many Matches?

In the previous section, you may have noticed the cryptic-looking {3} in the SSN regex. This is an example of a quantifier, which specifies how many times the preceding character or group should be matched.

The most common quantifiers are:

Quantifier Meaning
* Match 0 or more times
+ Match 1 or more times
? Match 0 or 1 time
{n} Match exactly n times
{n,} Match at least n times
{n,m} Match between n and m times

So X{3} would match "XXX" while \d{1,3} would match strings of digits between 1 and 3 characters long, like "7", "42", and "538".

Be careful with quantifiers – they are "greedy" by default, meaning they will match as much as possible while still allowing the overall pattern to match. Use the ? after the quantifier to make it "lazy" instead.

Alternation and Grouping

Sometimes you need to match one of several possible sub-patterns. You can use the pipe character | to separate the alternatives.

For example, cat|dog will match either "cat" or "dog". When combined with grouping parentheses, you can build up sophisticated alternatives like gupp(y|ies) to match "guppy" or "guppies".

You can also use alternation to match slight variations of the same pattern. Consider this regex that matches several common date formats:

/\d{4}-\d{2}-\d{2}|\d{1,2}\/\d{1,2}\/\d{2,4}/

It uses alternation to allow either a "YYYY-MM-DD" or "M/D/YY" format. Quite an improvement over writing separate regexes for each format!

Capturing Groups

Parentheses have another important function in regular expressions besides grouping – they create capturing groups that allow you to extract parts of the match for later use.

Each set of parentheses establishes a new captured group, numbered from left to right. You can reference these captures in the replacement string of a search-and-replace operation or retrieve them programmatically.

Captured groups are especially handy for isolating important parts of a structured string like a URL:

/(https?):\/\/([^\/\s]+)(.*)/

This breaks a URL down into:

  1. The protocol (http or https)
  2. The domain name
  3. The path and query string

You could then rearrange these pieces programmatically as needed, insert them into a template, etc. Extracting data from strings is one of the most powerful applications of regular expressions.

Lookaround

Another advanced feature of regular expressions is lookaround, which allows you to match a pattern only if it is preceded or followed by another pattern.

There are four types of lookaround:

Type Syntax
Positive lookahead (?=...)
Negative lookahead (?!...)
Positive lookbehind (?<=...)
Negative lookbehind (?<!...)

As an example, (?=\d{10})\d{3}-\d{3}-\d{4} will only match a US phone number if it is preceded by a 10-digit number, like an area code.

Lookahead and lookbehind are considered "zero-width assertions", meaning they do not consume any characters in the string, they simply assert whether a match is possible or not.

Bringing It All Together

Now that you know the key elements of regular expressions, let‘s see how they all fit together to tackle some real-world problems.

Example 1: Validating Email Addresses

Here‘s a regular expression that matches most valid email addresses:

/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/i 

Breaking it down:

  • [A-Z0-9._%+-]+: Match the username, allowing alphanumeric characters and some common symbols
  • @: Match the @ sign
  • [A-Z0-9.-]+: Match the domain name, allowing alphanumeric characters and hyphens/dots
  • .: Match a literal dot for the top-level domain separator
  • [A-Z]{2,}: Match the top-level domain, which must be at least 2 alpha characters
  • i: The "ignore case" flag makes the whole pattern case-insensitive

It‘s not 100% perfect as email addresses can contain other special characters, but it will match the vast majority of email addresses in use today.

Example 2: Parsing URLs

We touched on this briefly earlier, but let‘s flesh it out into a more robust example:

/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/ 

Here‘s what each piece does:

  • ^: Matches the start of the string
  • (https?:\/\/)?: Optionally matches the protocol (capturing it)
  • ([\da-z.-]+): Matches the subdomain and domain (capturing them)
  • .([a-z.]{2,6}): Matches the top-level domain (capturing it, 2-6 characters long)
  • ([\/\w .-]): Optionally matches the path, query parameters, and hash anchor (capturing them, 0 or more times)
  • \/?: Optionally matches a trailing slash
  • $: Matches the end of the string

This regex not only validates that a string is a well-formed URL, but also splits out the constituent parts for further processing. You could use the captured groups to construct a canonical version of the URL, extract query parameters, etc.

Where to Go From Here

Congratulations, you now have a solid foundation in regular expressions! But there is always more to learn. Here are some resources to continue your journey:

  • RegexOne (regexone.com) – Interactive regex tutorial
  • Regular-Expressions.info (regular-expressions.info) – Comprehensive regex reference and tutorials
  • Regex101 (regex101.com) – Online regex tester and debugger
  • Mastering Regular Expressions by Jeffrey E. F. Friedl – The definitive book on regex

The best way to solidify your understanding is through practice. Take every opportunity to use regex in your own projects. Write regexes to validate form inputs, scrape data from web pages, analyze log files, and more. Over time, you‘ll start to see regex patterns everywhere!

Also remember that while regex is a powerful tool, it‘s not always the best tool for the job. For complex parsing tasks, you may be better off using a dedicated parsing library. And always beware of the dreaded regex-based HTML parsing!

Happy coding, and may your regexes always match what you intend!

Similar Posts