Peter Kelly: 2014

Friday, May 9, 2014

Is TDD dead?

Was it ever alive? Who actually does/did TDD exactly as prescribed?

Write a test before you even write the class or method in the class under test
Run the tests, test fails
Do enough, and only enough, to make the test pass
Run the tests, test pass
Write more tests and continue to improve the design of the code

Not one developer I have ever worked with in any team in 10 years truly followed TDD as prescribed. I tried it for a few months and while it never slowed me down too much (a common complaint from people who have never attempted it) I felt it resulted in absolutely no benefit. I also felt I was doing something wrong when I wrote code without a test, I gave out to myself, deleted the code and started again. I write pretty good code I think and I know what bad code looks like.

Don't get me wrong. I write tests and almost every developer in every team I have worked in wrote unit tests. We typically did this as we wrote the code, probably towards the end after writing some feature and verifying it through a browser manually. The unit tests would check expected output and some boundary conditions not likely to show up in a quick check in the browser. All good? Code review, check-in, verify CI build runs and all tests pass. The key is that "all tests pass" includes unit tests and integration tests.

I never experienced the promised enlightenment of better design through TDD. Honestly, I did not. I like unit tests and well-written tests that run all the time help me refactor later and have spotted bugs being introduced very early. But writing tests first, before code and only in baby-steps seems (and always has seemed to me) to be overkill. It could be a useful training exercise for mentoring junior developers but for experienced developers and teams there is likely to be more value in writing code, adding some unit tests for sure, building out a continuous integration environment, having integration tests using servers, VMs, databases, browsers and whatever else and running integration tests anytime code is committed, checking user-focused behaviour.

One thing I have realised is that having a Test Engineer building automated integration tests can be frustrating while development is churning and APIs might be changing. So we came up with a rule recently in a team I worked in - if a developer writes some code that breaks the automation tests (developers can run these locally before committing, takes about 1 minute for 200+ tests), the dev fixes the integration test and sends a pull request to the Test Engineer. This frees the Test Engineer to work on new integration tests and maintaining that test and integration environment. We also have tasks that generally go no longer than 2 days so a developer will code away for no more than 2 days and then add some tests, check the integration tests, then commit and move on. Not perfect but works great, ship regularly and we have very low bug count. People over process, team figuring out what works best for the team. That is the true spirit of agile right?

Thursday, January 16, 2014

Python **kwargs appear immutable in method

This is an interesting one.

You can pass an arbitrary number of keyword arguments to a method in Python using **kwargs.
This is a dictionary and you can iterate over items(), keys() and values() and do all the usual dict stuff.
However if you say, delete an item from the dictionary from within a method it does not change the dict that was passed in.

Example with ordinary dict.


def filter_args(kwargs):
    for k, v in kwargs.items():
        if v > 10:
            del kwargs[k]

args = {"a":12, "b":2}
filter_args(args)
print args

This prints the following


{'b': 2}

Now an example with **


def filter_args(**kwargs):
    for k, v in kwargs.items():
        if v > 10:
            del kwargs[k]
args = {"a":12, "b":2}
filter_args(**args)
print args

This prints


{'a': 12, 'b': 2}

It only changed the dict inside the method.

Tuesday, January 14, 2014

Scaling a Web App - my experience

At a previous company I worked for we had 30 million users with over 2.5 million uniques a month and 1 million downloads a day. Not a massive scale but big enough - mid-level web scale? Anyway, it is a read-intensive website comprised of PHP, Apache, many back-end services (in PHP and Java), memcached, ActiveMQ and MySQL.This is short blog post about some of the ways in which scaling was handled.

Load-Balancing

F5 load balancer splitting traffic around 30 web servers.
Nothing stored in session anywhere allowing horizontal scaling at the app level.

Caching

20 memcached servers (with 32GB RAM each) split into different clusters (web, mobile etc.) caching result of anything that was slow e.g. database queries or web service calls. Generally split into clusters of 4 servers giving >100GB memory for caching.
We also cached generated HTML (non-dynamic HTML like headers and footers) on file system on each web server.
We cached nothing in memory on the web servers so the memory is only used by the web app and not taken up by cache.
We used consistent hashing algorithm for memcached to make cache server losses or additions less impactful.
Cache control headers set to allow browser caching for certain service calls and requests.

- Database

We used a Master/Slave architecture (MySQL) where Master is a single DB you write to and is then replicated to multiple Slaves for reading. Lag was normally less than 100ms. Works well for read-intensive services.
We also used sharding when writes became too numerous in certain cases essentially taking that Master/Slave model and replicating it to N shards – you know which master to write to based on user_id e.g. users 0 – 5m to one server, 5m – 10m to another etc. Each shard kept a similar size dataset making queries predictable. This is a pain in MySQL 5.1.
There was also use of the new MySQL NDB (or MySQL Cluster) which has auto-sharding and other big-scale goodness
DBAs code reviewed all stored procedures and gave feedback RE: indexing and optimisations

Message System

We used JMS on ActiveMQ for firing off events for Business Intelligence and other consumers (100s of events a second).

Static Assets

These were all served from lighttpd servers and not from the web servers themselves.
These were also then pushed out to a Content Delivery Network (Akamai) to make sure they are served as close to the requester as possible.

Offline Processing

We tended towards crons and daemons for any processing that could be done offline and not slow down a web request e.g. daemon to pick up new stuff in a table and fire off web requests per row to auditing service rather than calling the audit service as part of the original web request.

Deployments

Dev, QA, Integration, Staging and Production deployments mostly driven through Jenkins. Staging and Production were near-identical environments - firewall rules etc. to reduce surprises and rollbacks when deploying to prod
Half the production cluster is taken out of rotation from the load balancer and the code is deployed and smoke tested there first, then the cluster is flipped and deployed to other half. The idea is to have no downtime.

Downloads

Served from dedicated download servers and not from web cluster

Load Testing

We had multiple JMeter servers for load testing the websites and services. We did this testing on lower spec environments than production and our thought process was "if it is good enough there, it will be at least as good on production". This mantra held true.

Overall a pretty solid architecture and we worked hand-in-hand with a brilliant Ops team (sys admins, DBAs, NOC) all the time on this. It was not without problems and the odd firefight but it was pretty good. We never properly addressed the dog-pile effect when cache was flushed and the databases were overloaded but that was a problem with the way memcached was (ab)used early on.