Demystifying Regular Expressions in Ruby (2/2)

While being recognised as powerful, compact, and expressive, Regular Expressions (or RegExps) also have a reputation of being notoriously hard for humans to parse. In fact, a great developer once said this about using regular expressions: Now you have two problems!

In this post we follow up from part 1 and we’ll break down a regular expression that matches Markdown image links:

ruby markdown = <<-END I have a graph showing incredible statistics. You will be amazed by the clarity it brings! Behold: ![Incredible Graph](/the_graph.png "Graph") And yet another: ![Graph2](/other_graph.png) END

The code to match the image links would look something like this: ruby # Matching a pattern for Markdown images that look like: # # ![<alt_text>](<url> "<optional_title>") # if markdown =~ /!\[.*\]\(.*?( ".*")?\)/ puts 'Ugh, I guess a markdown image is in there?' end

Breaking Down a Complex Regular Expression

If you’re unfamiliar with the characters used in regular expressions for escaping, wildcard matching, grouping and repeating, then you could be forgiven for not fully understanding the above. Even if you do, it takes a bit of squinting to see the matches and escaping; it’s certainly not super readable. Let’s break it down to better understand it. Starting with the original pattern, wrapped in regular expression / markers: ruby /![<alt_text>](<url> "<optional_title>")/ - - First, we have to escape meta-characters that have special meanings by prefixing with a ‘', and this applies to ‘[’, ‘(‘, and ‘]’: ruby /!\[<alt_text>\]\(<url> "<optional_title>"\)/ - - - -

Next, let’s match any character (using wildcard .) zero or more times (using *) to make the pattern work for whatever <alt_text>, <url> and <optional_title> happen to be: ruby /!\[.*\]\(.* ".*"\)/ -- -- --

Now that’s close, but there’s a small flaw. The problem is that the title should be optional, yet with the above it isn’t!

To make it optional we wrap the title match in parens ( and ) to group it as a single unit, then append a ? to match exactly zero or one times. ruby /!\[.*\]\(.*( ".*")?\)/ # Unescaped parens mean "group". - --

Now it’s pretty close, except for a problem of greediness.

Greediness and Laziness

The greediness or laziness factor can be hard to visualise, so lets consider what’s going on in this particular case. What we intend:

ruby # The example: # ![<alt_text>](<url> "<optional_title>") /!\[.*\]\(.*( ".*")?\)/ -- ----- / \ the <url> the <optional_title>

The URL is intended to match by the .* pattern, but this means zero or more times, and more means as many as possible. Due to this, it will match on all the characters including the title, all the way up to )! This is known as “greedy” matching: match as much as possible while satisfying the rest of the regular expression, which is possible here since the title group is optional! This is really important when trying to extract the matched text. The solution is to stop being greedy! We can do that by appending a ?: ruby /!\[.*\]\(.*?( \".*\")?\)/ -

Now that we’re using .*? we’ve made the * lazy, and it will match as few repetitions as possible for the <url> while still matching overall.

Readability

We should aim for code that’s easy to read at a glance, since developers spend a lot more time reading and understanding code than writing it.

One of the tricks Ruby gives us is the ability to break up long regular expressions and even add comments for complex expressions:

ruby IMAGE_REGEXP = / !\[.*\] # The alt text of the Markdown. \( # Open paren for image URL and optional title. .*? # The image URL (the '?' makes it NOT greedy). (\ \".*")? # The optional title (here the '?' means limit to 0 or 1). \) # Close paren. /x

Notice that trailing x on the last line? It’s called “free-spacing” mode and helps tremendously. In free-spacing mode spaces are ignored, and you can insert normal Ruby comments. Just be careful and escape your spaces (with \) or they will be completely ignored.

Summary

The process used above is a useful one to go through when building up your regular expression: 1. Write the match as you require it literally. 2. Escape any meta-characters with a /. 3. Use wildcards and repetition meta-characters as required. 4. Think about greediness. 5. Try to stay sane! Here’s the full list of repetition meta-characters: * * - Zero or more times. * + - One or more times. * ? - Zero or one times (optional). * {n} - Exactly n times. * {n,} - n or more times. * {,m} - m or less times. * {n,m} - At least n and at most m times. The docs at ruby-doc.org include quite good explanations and references for all the meta-characters.