Blog

Demystifying Regular Expressions in Ruby (1/2)

Placeholder Avatar
Adam Davies
April 20, 2016

While being recognised as powerful, compact, and expressive, Regular Expressions (or RegExps) also have a reputation for being notoriously hard for humans to parse. In fact, a great developer once said this about them:

shaving yaks in a rabbit hole
sum it up: "for your sanity, don't do regexp" :D

In this series of posts, sanity is preserved as we review by example. We’ll see how RegExps are particularly effective at finding patterns in text, along with some less well known tricks that can improve readability.

In this post we’ll focus on how they are particularly effective at finding matches in text, along with some details on how it all works in Ruby.

Kinds of Characters

There are various kinds of characters used in RegExp. Some common ones include:

  • Literals: Matches the character in the target string.
  • Escaping: \ escapes a meta-character to be matched as a literal.
  • Wildcard: . means match any character.
  • Character classes: a specific set of characters to match (in any order).
  • Repetition: + means match one or more times.

A few examples of character classes are:

ruby [aeiou] # Any vowel. \w # Any word character. [[:blank:]] # Space or tab.

Searching for Whether a Pattern Exists

Given some text to match on, say, some Markdown:

ruby markdown = <<-END I have a graph showing incredible statistics. You will be amazed by the clarity it brings! Behold: ![Incredible Graph](/the_graph.png "Graph") And yet another: ![Graph2](/other_graph.png) END

We can use =~, #match, or === methods to detect if a png image is present:

ruby if /\w*\.png/ =~ markdown puts 'Found it using a squiggly.' end

ruby if /\w*\.png/.match(markdown) puts 'Found a MATCH!' end

ruby if /\(\w*\.png/ === markdown puts 'Found it with a looong equals sign!' end

The above methods all achieve the same thing: a truthy value for whether a match was successful (nil also evaluates to false in Ruby).

Further, they are all defined on the Regexp class, and the / character delimits a literal RegExp. It’s just as valid to do:

```ruby match = RegExp.new(‘\w*.png’).match(markdown)

Which returns…

# => #<MatchData “the_graph.png”> ```

String Methods

The above examples all used methods defined on Regexp. However, it’s very interesting to note that they’re all defined on String too. This means you can switch around the order:

ruby if markdown =~ /\w*\.png/ puts 'Yup, it's a png alright.' end

ruby if markdown.match(/\w*\.png/) puts 'Yup, works that way too!' end

Actually, as we’ll see RegExp are commonly seen in many String methods.

So which do I use?

As a matter of style, I like to use =~ when searching for text, since it looks more like an operator. You just need to remember the equal sign goes first, then the tilde.

The #match method is useful when you want more information, as it returns what was matched, including captured substrings.

Finally, the triple equals is known as “case equality” since it is what Ruby calls in case expressions, so it’s useful when you have several matches:

ruby what_i_found = case markdown when /\w*\.png/ then 'A png.' when /jpg/ then 'A jpg.' else 'Don't know.' end

It’s worth mentioning that the Ruby Style Guide recommends using [] for simple matches. It’s an alias of #slice and returns the matching string.

I think of it like a window looking into a part of the string.

```ruby matched_string = markdown[/\w*.png/] if !matched_string.empty? puts “Found #{matched_string} with square brackets.” +F1end

Returns:

# Found the_graph.png with square brackets. # => nil ```

Indexing

We used =~ above to detect pattern matches in a string, but to be honest, it really returns the index within the string. It works like a boolean above since it returns nil when there’s no match, and the positional index within the string otherwise.

I like to think of it as “equals-squiggle” since the RegExp can indeed be a squiggly looking mess.

As mentioned above, we’ll get the index returned, so actually:

```ruby pos = markdown =~ /\w*.png/

Returns:

# => 120 ```

…however, if we actually want the index, then it would be more intention revealing if we used String#index:

```ruby pos = markdown.index(/\w*.png/)

Also returns:

# => 120 ```

We can reverse the process to see what’s there using [] with a range starting at pos 120:

```ruby markdown[120..132]

Returns:

# => ‘the_graph.png’ ```

Search and Replace

As well as checking for whether a pattern exists, we can easily run a search-and-replace type operation using #sub or #gsub; they stand for substitute and global-substitute respectively.

ruby markdown.sub(/\.png/, '.jpg') => "I have a graph showing incredible statistics. You will be amazed by the clarity it brings! Behold: ![Incredible Graph](/the_graph.jpg \"Graph\") And yet another: ![Graph2](/other_graph.png)"

If you look carefully, you’ll see in the above that only the first .png got replaced. This is where #gsub is more useful:

ruby markdown.gsub(/\.png/, '.jpg') => "I have a graph showing incredible statistics. You will be amazed by the clarity it brings! Behold: ![Incredible Graph](/the_graph.jpg \"Graph\") And yet another: #![Graph2](/other_graph.jpg)"

Splitting Strings Up

In the following we use a regular expression to define the delimiter to split the string on:

```ruby markdown.split(/\W+/)

Returns:

# => [“I”, “have”, “a”, “graph”, “showing”, “incredible”, “statistics”, # “Behold”, “Incredible”, “Graph”, “the_graph”, “png”, “Graph”] ```

Here \W (upper-case) means any non-word character, so along with the + it uses these as delimiters, effectively pulling out the words.

Of course, this can be done using the inverse logic – and #scan:

ruby markdown.scan(/\w+/)

Another way of splitting strings is to partition on a split pattern. Here’s an example showing the use of #partition returning the pre-matched, matched and post-matched text:

```ruby “Can you find emphasis in your text?”.partition(/.+/)

Returns:

# => [“Can you find “, “emphasis”, “ in your text?”] ```

Where to Go Next?

So far we’ve seen simple patterns be used in various ways, including checking for existence, looking up positions, search-and-replace, and splitting strings up. The next post in this series will look at how we can extract a more complicated pattern-match, specifically, the image details.