Tuesday, January 14, 2014

Scaling a Web App - my experience

At a previous company I worked for we had 30 million users with over 2.5 million uniques a month and 1 million downloads a day. Not a massive scale but big enough - mid-level web scale? Anyway, it is a read-intensive website comprised of PHP, Apache, many back-end services (in PHP and Java), memcached, ActiveMQ and MySQL.This is short blog post about some of the ways in which scaling was handled.

Load-Balancing
  •           F5 load balancer splitting traffic around 30 web servers.
  •           Nothing stored in session anywhere allowing horizontal scaling at the app level.

      Caching
  •          20 memcached servers (with 32GB RAM each) split into different clusters (web, mobile etc.) caching result of anything that was slow e.g. database queries or web service calls. Generally split into clusters of 4 servers giving >100GB memory for caching.
  •           We also cached generated HTML (non-dynamic HTML like headers and footers) on file system on each web server.
  •           We cached nothing in memory on the web servers so the memory is only used by the web app and not taken up by cache.
  •           We used consistent hashing algorithm for memcached to make cache server losses or additions less impactful.
  •           Cache control headers set to allow browser caching for certain service calls and requests.

-          Database
  •       We used a Master/Slave architecture (MySQL) where Master is a single DB you write to and is then replicated to multiple Slaves for reading. Lag was normally less than 100ms. Works well for read-intensive services.
  •           We also used sharding when writes became too numerous in certain cases essentially taking that Master/Slave model and replicating it to N shards – you know which master to write to based on user_id e.g. users 0 – 5m to one server, 5m – 10m to another etc. Each shard kept a similar size dataset making queries predictable. This is a pain in MySQL 5.1.
  •           There was also use of the new MySQL NDB (or MySQL Cluster) which has auto-sharding and other big-scale goodness
  •           DBAs code reviewed all stored procedures and gave feedback RE: indexing and optimisations
Message System
  •           We used JMS on ActiveMQ for firing off events for Business Intelligence and other consumers (100s of events a second).
Static Assets
  •           These were all served from lighttpd servers and not from the web servers themselves.
  •           These were also then pushed out to a Content Delivery Network (Akamai) to make sure they are served as close to the requester as possible.
Offline Processing
  •           We tended towards crons and daemons for any processing that could be done offline and not slow down a web request e.g. daemon to pick up new stuff in a table and fire off web requests per row to auditing service rather than calling the audit service as part of the original web request.
Deployments
  •           Dev, QA, Integration, Staging and Production deployments mostly driven through Jenkins. Staging and Production were near-identical environments - firewall rules etc. to reduce surprises and rollbacks when deploying to prod
  •      Half the production cluster is taken out of rotation from the load balancer and the code is deployed and smoke tested there first, then the cluster is flipped and deployed to other half. The idea is to have no downtime.
Downloads
  • Served from dedicated download servers and not from web cluster

    Load Testing
  •       We had multiple JMeter servers for load testing the websites and services. We did this testing on lower spec environments than production and our thought process was "if it is good enough there, it will be at least as good on production". This mantra held true. 

      Overall a pretty solid architecture and we worked hand-in-hand with a brilliant Ops team (sys admins, DBAs, NOC) all the time on this. It was not without problems and the odd firefight but it was pretty good. We never properly addressed the dog-pile effect when cache was flushed and the databases were overloaded but that was a problem with the way memcached was (ab)used early on. 

1 comment: