Blog

Capturing Matches in Ruby Regular Expressions (bonus)

Placeholder Avatar
Adam Davies
June 21, 2016

In this post we follow up from part 1 and part 2 by looking at how to capture what was matched in Markdown image links:

markdown = «-END I have a graph showing incredible statistics. You will be amazed by the clarity it brings! Behold:

And yet another: END

Capturing the Matches

In the previous post, the ‘?’ was shown as dual purpose:

1) It can be used to mark a greedy repetition to be lazy: .*?, and 2) It can be used to mark a group as optional: (.*)?


IMAGE_REGEXP = /
  !\[.*\]      # The alt text of the Markdown.
  \(           # Open parentheses for the image URL and optional title.
    .*?        # The image URL (the '?' makes it NOT greedy).
    (\ \".*")? # The optional title (here the '?' means limit to 0 or 1),
  \)           # Close parentheses.
/x

In the complicated Markdown regular expression above we used grouping parentheses to apply the ? to the optional title. Another very useful feature for grouping matches is for capturing, meaning whatever strings we capture, we can extract for each match. Furthermore, the groups can be given names to make them easier to lookup.

First, let’s apply grouping to allow us to capture each interesting piece of data:


IMAGE_REGEXP = /
  !\[(.*)\]       # The alt text of the Markdown.
  \(              # Open parentheses for the image URL and optional title.
    (.*?)         # The image URL (the '?' makes it NOT greedy).
    (\ \"(.*)\")? # The optional title (here the '?' means limit to 0 or 1).
  \)              # Close parentheses.
/x

Now we can scan for matches:


markdown.scan(IMAGE_REGEXP)
=> [["Incredible Graph", "/the_graph.png", " \"Graph\"", "Graph"], ["Graph2", "/other_graph.png", nil, nil]]

Oops, we have matched title twice because of the grouping we did earlier in order to make it optional. That group is returned by scan first and shows up with a leading space and quotes.

This is corrected by grouping without capturing, via the prefix ?::


IMAGE_REGEXP = /
  !\[(.*)\]         # The alt text of the Markdown.
  \(                # Open parentheses for image URL and optional title.
    (.*?)           # The image URL (the '?' makes it NOT greedy).
    (?:\ \"(.*)\")? # The image URL (the '?' makes it NOT greedy, and not capturing outer group due to ?: prefix).
  \)                # Close parentheses.
/x

Now we get:


markdown.scan(IMAGE_REGEXP)
=> [["Incredible Graph", "/the_graph.png", "Graph"], ["Graph2", "/other_graph.png", nil]]

Naming the Captures

The above is useful, but we can do one more thing: name the captures. This is done in a similar way as marking groups that don’t capture. In this case, we use a prefix of ?<my_name> to name the capture:

IMAGE_REGEXP = / ![(?.*)\] # The alt text of the Markdown. \( # Open parentheses for image URL and optional title. (?.*?) # The image URL. (?:\ \"(?.*)\")? # The optional title. \) # Close parentheses. /x

Now with these named captures we can get a more readable result using #match:

match = markdown.match(IMAGE_REGEXP) => #<MatchData alt_text:”Incredible Graph” url:”/the_graph.png” title:”Graph”> match[:url] => “/the_graph.png”

You may have noticed that a call to #match only returns the first match found, since the return MatchData is like a Hash that contains captured matches. This is in contrast to #scan, which returns arrays of matches.

We can do the following little trick to combine these:

named_captures = IMAGE_REGEXP.names => [“alt_text”, “url”, “title”] array_of_matches = markdown.scan(IMAGE_REGEXP) => [[“Incredible Graph”, “/the_graph.png”, “Graph”], [“Graph2”, “/other_graph.png”, nil]] array_of_matches.map {|match| Hash[named_captures.zip(match)] } => [{“alt_text”=>”Incredible Graph”, “url”=>”/the_graph.png”, “title”=>”Graph”}, {“alt_text”=>”Graph2”, “url”=>”/other_graph.png”, “title”=>nil}]

StringScanner

As a last trick, we’ll look at StringScanner. It’s a useful class that provides a more object-oriented imperative style of matching, since it maintains state. With it, we can search for matches, then continue on from the last position.

Here’s an example, iterating through the matches one at a time:

require ‘strscan’

scanner = StringScanner.new(markdown)

while scanner.scan_until?(IMAGE_REGEXP) puts “We have #{scanner[:alt_text]}: #{scanner[:url]}” end

Output:

We have Incredible Graph: /the_graph.png We have Graph2: /other_graph.png

Final Thoughts

Regular expressions are tricky to learn, but don’t be discouraged: even the most experienced developers need to look up their syntax every now and then. I hope these few posts have helped teach you the basics, or find the answer to that bug you’ve been banging your head on your desk over. Thanks for reading!