Skip to content
PLAY VIDEO PLAY VIDEO PLAY VIDEO
By Glen Crawford

Making crawling and scraping websites slightly less painful with Anemone

If you have ever had to scrape a website before to harvest data (and I sincerely hope that you haven't), then you will know and understand the pain of writing a script to automate the trawling of a site and parsing flaky and inconsistent HTML to get to the data that you need. It can be frustrating, and depressing, and painful, but luckily, there are tools to make it slightly less unbearable.

One of those tools is called Anemone. Anemone is a Ruby gem that enables you to write scripts to crawl one or more websites to parse their pages and harvest information. It takes care of all the basic things like making the requests, collecting the URLs on a page, following redirects, etc, and gives you hooks into parts of the process to allow you to specify which links to follow, which pages to parse, and so on.

There are three main areas of Anemone to explain: options, "verbs" and the page objects. I'll explain them as I work through an example. The following code is a quick experiment using Anemone to crawl the reInteractive blog and count how many posts each author has written (conveniently forgetting that we already have an Atom feed for that).

Configuring the crawling behaviour

Start by adding the anemone gem to your project (place it into your Gemfile, or install it yourself and require 'anemone').

Then invoke Anemone#crawl with the starting URL of your crawler. You can also pass in options to customize Anemone's behaviour. Anemone comes with some sensible defaults however, which you can see here.

The focus_crawl method allows you to pass in a block that selects which links on each page should be followed. page.links returns the href attribute's value of all <a> elements on the page. In the example below I'm testing each link against two different regular expressions, so that only URLs from blog navigation links (the "Newer" and "Older" links at the bottom of the pages) and links to blog posts are followed. Of course, many URLs will pop up on multiple pages, but don't worry, Anemone will only crawl each URL once, no matter how many times the URL is found.

The on_every_page method allows you to perform an action on each page that the crawler visits. The block you pass in has access to the page object (Anemone::Page) that represents the page. It gives you access to the page's body, the HTTP status code, referer, etc. In my example, if the page was returned with a 200 OK and is a blog post page, then I am passing the page into a method to process the blog post.

Anemone provides various methods to customise its crawling behaviour. As well as focus_crawl and on_every_page there is after_crawl, on_pages_like, and skip_links_like. But the former two are the ones that you will be using most often.

Anemone.crawl("http://www.reinteractive.net/blog", :verbose => false, :depth_limit => 5) do |anemone|
  anemone.focus_crawl do |page|
    page.links.select{|link| link.to_s.match(BLOG_NAVIGATION_URLS) || link.to_s.match(BLOG_POST_URLS)}
  end
  anemone.on_every_page do |page|
    process_blog_post(page) if page.code == 200 && page.url.to_s.match(BLOG_POST_URLS)
  end
end

Parsing the page

Now that you have the page and you know that you want to parse it, you can start picking the values that you want out of the body of the document. The page object has an attribute called doc, which returns a Nokogiri::HTML::Document representing the page's body. You can search through the document for the values that you want by using the css and xpath methods. The Nokogiri site itself has a great tutorial on searching through an HTML document using Nokogiri.

In this example, I'm using the method of using CSS selectors to locate the values that I want: the blog post's title and author. I'm then just building up a hash of these authors and posts, i.e., {"Glen Crawford" => ["Post #1", "Post #2"]}.

@authors_and_posts = {}

def process_blog_post(page)
  title = page.doc.css(".blog-header h1").text
  author = page.doc.css("meta[name='author']").attr("content").value

  @authors_and_posts[author] ||= []
  @authors_and_posts[author] << title unless @authors_and_posts[author].include?(title)
end

Conclusion

And that's all there really is to it. We configured Anemone, gave it a starting URL, told it how to decide which links to follow, what to do with each matching page that it found, and then simply parsed out the values that we needed. This makes it easy to see that Chloé and Mikel are well in front in terms of blog posts published, with 41 and 22 respectively, with Leonard catching up with 9 posts.

Obviously, scraping a website like this to harvest data isn't ideal. It takes time to run and generates a bunch of HTTP requests, it can be a pain to implement the regular expressions and XPaths, and most importantly, the website could be modified or rebuilt, changing the URLs and HTML structure of the pages. In the latter case, your crawler would likely break, and you would have to fix it or do it again from scratch.

In most cases, it is far better to pull the data from an API of some sort. But if the website that you want the data from doesn't have one, or won't implement one, or their API doesn't give you all the data that you need, then you might have to turn to crawling and scraping the site. And if you have to do that, then Anemone is a great tool for making that process a whole lot less painful than it can be.

Latest Articles by Our Team

Our expert team of designers and developers love what the do and enjoy sharing their knowledge with the world.

We Hire Only the Best

reinteractive is Australia’s largest dedicated Ruby on Rails development company. We don’t cut corners and we know what we are doing.

We are an organisation made up of amazing individuals and we take pride in our team. We are 100% remote work enabling us to choose the best talent no matter which part of the country they live in. reinteractive is dedicated to making it a great place for any developer to work.

Free Community Workshops

We created the Ruby on Rails InstallFest and Ruby on Rails Development Hub to help introduce new people to software development and to help existing developers hone their skills. These workshops provide invaluable mentorship to train developers, addressing key skills shortages in the industry. Software development is a great career choice for all ages and these events help you get started and skilled up.

  • Webinars

    Webinars

    Webinars are our online portal for tips, tricks and lessons learned in everything we do. Make the most of this free resource to help you become a better developer.

    Learn more about webinars

  • Installfest

    Installfest

    The Ruby on Rails Installfest includes a full setup of your development environment and step-by-step instructions on how to build your first app hosted on Heroku. Over 1,800 attendees to date and counting.

    Learn more about Installfest

  • Development Hub

    Development Hub

    The Ruby on Rails Development Hub is a monthly event where you will get the chance to spend time with our team and others in the community to improve and hone your Ruby on Rails skills.

    Learn more about Development Hub

Get the “reinteractive Review” Monthly Email