A great article on whether the cost of converting to Active Active is worth it in Availability Digest last month. Well worth reading.
Active Active is an architecture where two nodes transact and keep themselves up-to-date. Well, that’s the theory anyway. A device or user sends a “transaction” to one of the nodes and if it doesn’t get a response then the request is sent again to the other node, which, ideally, is situated somewhere else, geographically.
The author has done a good job of discussing the costs associated with altering code etc to make it all work. The interesting phrase, of course, is “……..virtually eliminating……..downtime”.
If the nodes contain any form or database (e.g. an account balance in a credit card or ATM type system), then it is possible that the balance on one node is different from the other. This brings the challenge of trying to work out which one is correct. This is called “split brain syndrome”. It’s a ‘mare’ to put right.
Imagine you go to an cashmachine to withdraw some money. The request fires off to NODE-1. NODE-1 updates your account balance then the instruction goes back to the ATM to dispense your money. On the way there is a failure.Now, the ATM knows it hasn’t got a response so requests the same (after a suitable timeout) of NODE-2.
At this stage no one knows whether NODE-2 has the old or the new balance. If you originally requested £100 and you had a balance of £1000, then the new balance should be £900.The ATM requests £100 again so you now have a balance of £800 but you have only got £100 out! Think what happens if your balance was only £100 in the first place. No money!
The alternative Scenario is NODE-1 gives you the money then dies before NODE-2 is updated. Then if you go to the ATM again, the ATM fires a request off to NODE-B and the original account balance of £1000 is still there. Then you can get a second lot of £100 out (so now you have £200) but the account balance has only been debited by £100. Bingo!
In the example above, I concede that I am over simplifying what happens and the checks and balances that go on between the devices – but if you probe various vendors who have architected active/active type solutions, there is evidence to suggest that the above can happen. My verdict?
Active/Active is great for planned downtime (shift all transactions to one of the nodes whilst the other goes into maintenance mode) and as an ‘in-place’ automatic data recovery facility. For ultra high availability in the first instance though, use Fault Tolerant technology!