Rescuing an Exploding Rails App with Skylight.io
Recently, a new client approached us with a performance problem on their existing Ruby on Rails application; they were experiencing massive growth with over 50,000 new users per day signing up, and their app was receiving over requests per second (and growing).
The rapidly increasing load was leading to big problems, with their existing Rails application experiencing frequent outages and causing sleepless nights for their team. They asked reinteractive to investigate and find out how we could get the app stable as fast as possible.
As time was of the essence, our first move was to install our performance tool of choice skylight.io (referral link, with a $50 credit). Within 15 minutes we already had enough data flowing into Skylight to use its helpful (and aptly named) Agony Index to get an idea of what was going on.
It was clear that a lot of performance could be gained by digging into the
DevicesController#index action. It had a 463ms “problem” response and was typically around 144ms. The key issue was this endpoint was being hit a LOT. It was the most hit action in the system getting close to a million requests per second, so any change we made to it would have a massive impact.
When we dug deeper, we learned that the main web servers were being hammered, causing load time to increase and further degrading performance. So getting this action performant would give a lot of wins across the whole system.
Opening up this action we found the following trace: Some things jumped out immediately. There was a large spread across the response time distribution, all the way up to. This indicated that the server was taking variable amounts of time to service the request, which usually points towards a loaded down server with not enough available capacity.
Looking at the trace, we could see that there was a SQL
select query to a
settings table, as well as three separate
select queries to the
devices table. Each of these queries were taking about 12ms to complete, so they were good targets for optimising.
Looking at the code, we found that there was a
Setting model and that a
Setting record was being called on each request to configure the app. After confirmation with the client it was found that this setting data barely ever changed (about once every few months) and so implementing this in a cache would be totally fine. Now, looking at this from a Skylight trace it is completely obvious to not make a DB request million times per second to see if any application wide configuration had happened to have changed, but this loading of settings data was buried within the
application_controller and pretty much everyone had forgotten it was being called. Seeing it bright as day in the trace though got everyone’s attention, and so this could be cached out instantly.
This was soon done and not only did it immediately drop the response time to the
DevicesController#index action by 12ms, but also it removed the SQL query for every other request landing on the app, saving 12ms by about 1.5 billion requests a month. This one small change saved over hundreds of HOURS of computing time a month! Not a bad saving from what worked out to be about 3 lines of code.
Next up were the three select queries against the Devices table.
The first call was loading up the device in question by ID. As Rails already by default creates a unique index on the devices table primary key, not much could be done about this.
But the other two calls were checking for uniqueness on a validation; that is, the
Device model had unique validation requirements on two attributes that looked something like this:
What this meant was that before the Rails app could save the devices model, it had to check to make sure that no other record in the table had either of these attributes, and the only way to do that was to fire two SQL queries before any create or update action that looked like:
SELECT 1 FROM devices WHERE devices.id = "?"
AND devices.first_attribute != "?"
SELECT 1 FROM devices WHERE devices.id = "?"
AND devices.second_attribute != "?"
These two queries had to traverse the
devices table and were taking another 12ms each, or about 24ms for the pair.
As there were already unique indices on the devices table for both the
first_attribute and the
second_attribute, we were able to make a change in the Rails code to only reverify the uniqueness of the attributes if the record had changed. So we made the following change to the code:
This removed the need to check on data updates where the attributes had not changed, since if the attributes were still the same value that was retrieved from the database, they were already guaranteed to be unique.
After making the above two changes, the new trace looked like this: Which showed a number of things that all pointed to success.
Firstly, the clustering of the response times were much tighter, indicating that the server had more breathing time to handle the requests (this was also in part to us helping the client boot up some more application servers) and secondly, the three select queries were no longer in the trace, saving 30-Matenia Rossidesms per request, and also reducing the load on the database by about 900 queries per second.
The client app is now stable and we are helping them move to our OpsCare Service over the coming months. If you have a Ruby on Rails application that needs some performance tuning, please get in touch.