Leveraging Caching and Amazon EC2 to scale from 100k to 6M visits per day

25 Feb


BTR Growth

Back in Early 2012 were looking at under 100K visitors per day and struggling with our current infrastructure. Our Databases were straining, our web servers were railing and our all important ad server was maxed out.  We knew that caching was the only answer. We embarked on a multi faceted attack to be able to support 10x growth.

We thought about three levels of caching and how to achieve it.

  1. Data Caching. Caching Data sets even if only for a minute.
  2. Dot.Net object caching.
  3. Web Page HTML caching.

Data Caching – The fastest database query is the one that doesn’t have to happen.

For Data caching, we considered the holy grail of data caching layers where every request reads and writes to the cache and hits to the DB second via the middleware.  However, we didn’t have the resources or the time to go down that path yet.  We wound up using three different data caching mechanisms which were best for each use case.

The first data caching layer that we used was within .Net itself. We used .net data caching. We decided to use this for smaller data sets that rarely change such as for configuration data and lookup tables that required expiration only when the data changed. We wanted to limit this because with .net caching, the same data is stored in memory for every process on every server. We run 3 processes in a web garden for each web server so we would be caching the same data 30+ times. For data that changed rarely and accessed often,  this made sense.

For more set based data caching we turned to  Redis as a caching data store and Postsharp as a mechanism to instantiate it. Initially we used Service Stack as the client for Redis but were not happy with the performance. When we dug into it, we noticed that service stack opened and closed a connection with every access. Wasn’t very efficient. We switched to using Booksleeve  and that worked well for a while but had trouble with deserializing nested json objects.  Finally, we turned to protobuf which not only allowed for ascii serialization, but binary which was even more performant since it made the payload to and from Redis that much smaller.

We had our home grown caching layer but soon found another method of data caching that was super simple to implement and met a need that our other mechanisms did not. We found a product called ScaleArc that sits between the application and the database as a proxy/cache layer. To use it, all you do is point your connection string to it and set it up to use your existing DBs as the origins. Then all requests are reported in nice analytics. You can then see which queries are the most expensive and then set up ScaleArc to cache the result set for that type of request. It gives great stats on DB usage and cache hit ratios. We first used it to help scale our third party ad server since we didn’t have access to the code.  Our ad server has a horrible Sql Server DB bottleneck. We were railing it constantly which caused us a loss of potential revenue. We upgraded the hardware and then soon railed that server. Since it was a third party application, we could not make any changes to the app to improve it. ScaleArc allowed us to add a data caching layer without touching the application.  This gave us a lot more room for growth and paid for itself many times over.

One of our big sources of revenue was  advertising.  We did a lot buying of keywords to drive traffic to the site.  Then in turn, we made money on the ads that were on the pages that these users visited.  These users were not the same type of engaged users that our normal registered users were, so we wanted to segment this traffic so that these visitors did not affect our normal users.  We leveraged Amazon EC2 instances to serve up that traffic.  Since our DB servers sat in our collocation space, there was a lag to access the data from EC2 to our datacenter servers. We put ScaleArc in place for the  Amazon instances for our purchased traffic and implemented heavy caching.  Since most of the traffic driven by marketing is driven initially to a set of landing pages, the db caching had a high cache hit rate and has allowed us to scale that business tremendously. caching – Good for some applications.

Data caching can only take you so far.  Better yet, avoid needing to require any data access at all. To this end we started caching components of a page. We mainly cached some slowly changing user controls, header, footer and a handful of fairly static pages.

When you cache user controls, we quickly learned that referencing them in subsequent page loads becomes problematic. Since it is cached, any reference to the control will return null. We could not access the control. So we had to put in some error trapping. Downside of  output caching is that there is still work being done on the server side even though greatly reduced. It forces you to develop with specific patters to expect controls to not be referenced in your code.

Web Page HTML caching – Edge caching for the win!

Things were getting better but to get 10-20x growth, we needed to go another caching step. The best way to scale your infrastructure is to not have it hit your infrastructure at all. CDN caching was the next important step for us. We had been using CDNs for our audio files, javascript, css and images for a while, but now we set our sights on caching of full pages.

When we went to the Velocity Conference for the first time a couple of years ago, we learned about a product called aiCache.  It is a caching appliance with a lot of flexibility. It could have different caches based on cookie, browser, etc.

This worked well for us because most of our traffic was guests who were not logged in. So based on cookie we were able to direct traffic via the load balancer to aiCache if there authenticationcookie or directly to our origin servers if the user was logged in. This way we didn’t have to change the application to show personalized information.

As we continued to grow we started seeing bottlenecks with logged in traffic and the aiCache infrastructure. So we wanted to add a layer of edge caching for our high traffic pages.

We had been using limelight as our CDN for images but the performance wasn’t fantastic.  After attending Velocity in California  we learned about TCP/IP overloading and looked for a CDN vendor that did it. The overloading is much faster than standard tcp/ip because it sends packets and does not wait for an Ack before sending the next packet. We saw times of 15ms from Cotendo vs 100ms from limelight for the same cached content. We went with Cotendo which was recently bought by Akamai. (Boo)

The key with successful CDN edge caching of our pages was two fold.

1)      Proper configuration – Setting TTLs, Include Querystring, etc

2)      Client Side Customizations.

Configuration required understanding the data and how often it changed. The structure of the URLs and understanding what requests couldn’t be cached was crucial.  Many Ajax requests and some pages needed to bypass cache and go directly to the origin servers. We had to carefully set TTLs and  identify when querystring should be part of the cache key or not.

Most importantly, our pages had customizations based on who is logged in. It needed to show user driven menus that changed to show the user’s avatar, premium level, etc.   In order to properly cache pages at the edge, we had to cache a generic for a non logged in user and then apply the customizations client side.  The complexity is that you don’t want the user to see the generic page and see the customization snap in. We want the page to appear nicely for the user.

To do this we knew that it had to be done in the header with as little required code as possibly. If we waited for JQuery to load or if we made an ajax call to get information it would slow down the page or cause that snap in. So the lesser of the evils was to store the minimum needed for the header customization in a cookie and have the small js snippet that knows how to generate the header html customizations in-line on the pages.

Interesting management challenge with edge caching of most pages was the paradigm shift needed by our developers.  We had a lot of old page customization code that we had to pull out of the back end C# code and move it front end Javascript.  User-based Analytics and displaying page partials based on the user’s history and level had to all be removed from the C# and moved to Javascript.  Developers needed to switch their thinking that pages would be generic for all users.  Even months after we implemented this pattern, we were still catching mistakes during code reviews where developers were doing server side customizations rather than on the client.

With this in place we were free to launch the CDN edge caching of HTML and our logged in users saw what they expect to see. All was good.

Because of the increased cookie payload, we made sure to move most of our static file http requests to cookieless domains ( so it wouldn’t add a lot more overhead to the http requests.

Adding all of this caching enabled us to grow our traffic over 10x and still have plenty of room to grow another 10x.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: