While being recognised as powerful, compact, and expressive, Regular Expressions (or RegExps) also have a reputation for being notoriously hard for humans to parse. In fact, a great developer once said this about them:
shaving yaks in a rabbit hole
sum it up: "for your sanity, don't do regexp" :D
In this series of posts, sanity is preserved as we review by example. We’ll see how RegExps are particularly effective at finding patterns in text, along with some less well known tricks that can improve readability.
In this post we’ll focus on how they are particularly effective at finding matches in text, along with some details on how it all works in Ruby.
Kinds of Characters
There are various kinds of characters used in RegExp. Some common ones include:
- Literals: Matches the character in the target string.
- Escaping:
\
escapes a meta-character to be matched as a literal. - Wildcard:
.
means match any character. - Character classes: a specific set of characters to match (in any order).
- Repetition:
+
means match one or more times.
A few examples of character classes are:
ruby
[aeiou] # Any vowel.
\w # Any word character.
[[:blank:]] # Space or tab.
Searching for Whether a Pattern Exists
Given some text to match on, say, some Markdown:
ruby
markdown = <<-END
I have a graph showing incredible statistics.
You will be amazed by the clarity it brings!
Behold: ![Incredible Graph](/the_graph.png "Graph")
And yet another:
![Graph2](/other_graph.png)
END
We can use =~
, #match
, or ===
methods to detect if a png
image is present:
ruby
if /\w*\.png/ =~ markdown
puts 'Found it using a squiggly.'
end
ruby
if /\w*\.png/.match(markdown)
puts 'Found a MATCH!'
end
ruby
if /\(\w*\.png/ === markdown
puts 'Found it with a looong equals sign!'
end
The above methods all achieve the same thing: a truthy value for whether a match was successful (nil
also evaluates to false
in Ruby).
Further, they are all defined on the Regexp class, and the /
character delimits a literal RegExp. It’s just as valid to do:
```ruby match = RegExp.new(‘\w*.png’).match(markdown)
Which returns…
# => #<MatchData “the_graph.png”> ```
String Methods
The above examples all used methods defined on Regexp
. However, it’s very interesting to note that they’re all defined on String too. This means you can switch around the order:
ruby
if markdown =~ /\w*\.png/
puts 'Yup, it's a png alright.'
end
ruby
if markdown.match(/\w*\.png/)
puts 'Yup, works that way too!'
end
Actually, as we’ll see RegExp are commonly seen in many String
methods.
So which do I use?
As a matter of style, I like to use =~
when searching for text, since it looks more like an operator. You just need to remember the equal sign goes first, then the tilde.
The #match
method is useful when you want more information, as it returns what was matched, including captured substrings.
Finally, the triple equals is known as “case equality” since it is what Ruby calls in case expressions, so it’s useful when you have several matches:
ruby
what_i_found = case markdown
when /\w*\.png/ then 'A png.'
when /jpg/ then 'A jpg.'
else 'Don't know.'
end
It’s worth mentioning that the Ruby Style Guide recommends using []
for simple matches. It’s an alias of #slice
and returns the matching string.
I think of it like a window looking into a part of the string.
```ruby matched_string = markdown[/\w*.png/] if !matched_string.empty? puts “Found #{matched_string} with square brackets.” +F1end
Returns:
# Found the_graph.png with square brackets. # => nil ```
Indexing
We used =~
above to detect pattern matches in a string, but to be honest, it really returns the index within the string. It works like a boolean above since it returns nil
when there’s no match, and the positional index within the string otherwise.
I like to think of it as “equals-squiggle” since the RegExp can indeed be a squiggly looking mess.
As mentioned above, we’ll get the index returned, so actually:
```ruby pos = markdown =~ /\w*.png/
Returns:
# => 120 ```
…however, if we actually want the index, then it would be more intention revealing if we used String#index
:
```ruby pos = markdown.index(/\w*.png/)
Also returns:
# => 120 ```
We can reverse the process to see what’s there using []
with a range starting at pos
120
:
```ruby markdown[120..132]
Returns:
# => ‘the_graph.png’ ```
Search and Replace
As well as checking for whether a pattern exists, we can easily run a search-and-replace type operation using #sub
or #gsub
; they stand for substitute and global-substitute respectively.
ruby
markdown.sub(/\.png/, '.jpg')
=> "I have a graph showing incredible statistics.
You will be amazed by the clarity it brings!
Behold: ![Incredible Graph](/the_graph.jpg \"Graph\")
And yet another:
![Graph2](/other_graph.png)"
If you look carefully, you’ll see in the above that only the first .png
got replaced. This is where #gsub
is more useful:
ruby
markdown.gsub(/\.png/, '.jpg')
=> "I have a graph showing incredible statistics.
You will be amazed by the clarity it brings!
Behold: ![Incredible Graph](/the_graph.jpg \"Graph\")
And yet another:
#![Graph2](/other_graph.jpg)"
Splitting Strings Up
In the following we use a regular expression to define the delimiter to split the string on:
```ruby markdown.split(/\W+/)
Returns:
# => [“I”, “have”, “a”, “graph”, “showing”, “incredible”, “statistics”, # “Behold”, “Incredible”, “Graph”, “the_graph”, “png”, “Graph”] ```
Here \W
(upper-case) means any non-word character, so along with the +
it uses these as delimiters, effectively pulling out the words.
Of course, this can be done using the inverse logic – and #scan
:
ruby
markdown.scan(/\w+/)
Another way of splitting strings is to partition on a split pattern. Here’s an example showing the use of #partition
returning the pre-matched, matched and post-matched text:
```ruby “Can you find emphasis in your text?”.partition(/.+/)
Returns:
# => [“Can you find “, “emphasis”, “ in your text?”] ```
Where to Go Next?
So far we’ve seen simple patterns be used in various ways, including checking for existence, looking up positions, search-and-replace, and splitting strings up. The next post in this series will look at how we can extract a more complicated pattern-match, specifically, the image details.