Some time ago I attended a presentation by a former VMware, now Microsoft employee who claimed that Hyper-V’s lack of Live Migration aka VMotion is not relevant at all. According to him, the only people vigorously demanding such a feature are consultants, never customers. At the time I thought: “What a silly marketing number this is. Microsoft does not have it, so they tell everyone that it is not really needed until they have it.”
Today I read a news article (beware: it’s in German) citing Christofer Hoff who essentially says that most people do not use tools like VMotion outside of planned manual activities at their current maturity level. His audience at RSA Conference 2009 seemed to second that – only a handful of those using VMware hypervisors also use VMotion in non-failure situations.
Clustering = Complexity
That got me thinking. I remembered a project where I advocated the use of Windows clusters to increase the availability of critical file and print servers. The customer was with me, and Windows clustering it was. Design and implementation of the new clustered servers went well. Demonstrating the move of virtual servers between nodes is an experience that quickly convinced even the greatest skeptic. It’s simply cool to see a (virtual) server go offline on one node and come back seconds later on another, serving files as if nothing had happened.
By the way: I am talking about Windows failover clustering. Although the terminology is similar, clustered virtual servers are not related to virtual machines. But there are striking similarities…
After the new clustered servers went into production, at first everything was great. Until, slowly but steadily, reality crept in. And reality can be ugly.
With clustering you add a whole new layer of complexity to an already complex system. You need fibre channel, shared disks, clustering software, multipath IO and whatnot. While single servers are complex enough, but considered manageable, clustered servers present an exponentially higher challenge to the administration staff. And, oh yes, the staff really needs to understand what is going on, or in case of a failure you are in really deep shit.
Put simply, we discovered that if one hard disk (LUN) accessed by one server has a complexity level of 1, a hard disk accessed by multiple clustered servers has a complexity level of at least 4.
Do not misunderstand me: clustering is a great technology, and Microsoft if putting a great deal of effort into making it simpler to handle. But still, you need a highly trained staff to manage the beast, while single servers can be administered by the average admin. The difference in pay levels should be taken into account when thinking about actually implementing clustering.
Clustering = VMotion / Live Migration?
Now let’s go back to the original topic, VMotion and Live Migration (finally to be included in Server 2008 R2). Like failover clustering, these technologies are great. But technology has no value by itself. Only if it helps us do things easier and/or faster can we derive value from it. But as with clustering, VMotion introduces way too many new dependencies and layers, adding to the complexity of the system, that justify its use only in select cases.
So it may well be true what the presenter cited at the beginning of this article said. Microsoft did encounter problems with the implementation of Live Migration. But does it matter? It may well be that they only worked on this feature because VMware has it, and because consultants and analysts do not consider their product to be up to par without it.