VMotion Is Hyped by Consultants, But Do Admins Even Use It?
Some time ago I attended a presentation by a former VMware, now Microsoft employee who claimed that Hyper-V’s lack of Live Migration aka VMotion is not relevant at all. According to him, the only people vigorously demanding such a feature are consultants, never customers. At the time I thought: “What a silly marketing number this is. Microsoft does not have it, so they tell everyone that it is not really needed until they have it.”
Today I read a news article (beware: it’s in German) citing Christofer Hoff who essentially says that most people do not use tools like VMotion outside of planned manual activities at their current maturity level. His audience at RSA Conference 2009 seemed to second that – only a handful of those using VMware hypervisors also use VMotion in non-failure situations.
Clustering = Complexity
That got me thinking. I remembered a project where I advocated the use of Windows clusters to increase the availability of critical file and print servers. The customer was with me, and Windows clustering it was. Design and implementation of the new clustered servers went well. Demonstrating the move of virtual servers between nodes is an experience that quickly convinced even the greatest skeptic. It’s simply cool to see a (virtual) server go offline on one node and come back seconds later on another, serving files as if nothing had happened.
By the way: I am talking about Windows failover clustering. Although the terminology is similar, clustered virtual servers are not related to virtual machines. But there are striking similarities…
After the new clustered servers went into production, at first everything was great. Until, slowly but steadily, reality crept in. And reality can be ugly.
With clustering you add a whole new layer of complexity to an already complex system. You need fibre channel, shared disks, clustering software, multipath IO and whatnot. While single servers are complex enough, but considered manageable, clustered servers present an exponentially higher challenge to the administration staff. And, oh yes, the staff really needs to understand what is going on, or in case of a failure you are in really deep shit.
Put simply, we discovered that if one hard disk (LUN) accessed by one server has a complexity level of 1, a hard disk accessed by multiple clustered servers has a complexity level of at least 4.
Do not misunderstand me: clustering is a great technology, and Microsoft if putting a great deal of effort into making it simpler to handle. But still, you need a highly trained staff to manage the beast, while single servers can be administered by the average admin. The difference in pay levels should be taken into account when thinking about actually implementing clustering.
Clustering = VMotion / Live Migration?
Now let’s go back to the original topic, VMotion and Live Migration (finally to be included in Server 2008 R2). Like failover clustering, these technologies are great. But technology has no value by itself. Only if it helps us do things easier and/or faster can we derive value from it. But as with clustering, VMotion introduces way too many new dependencies and layers, adding to the complexity of the system, that justify its use only in select cases.
So it may well be true what the presenter cited at the beginning of this article said. Microsoft did encounter problems with the implementation of Live Migration. But does it matter? It may well be that they only worked on this feature because VMware has it, and because consultants and analysts do not consider their product to be up to par without it.
12 Comments
We use VMware clustering at our office, not so that VMotion can be used, but so that individual ESX servers become all but irrelevant to the host users–the cluster allows me, an admin, to bring one host down for an upgrade without having to coordinate this activity with the users of the dozen or VMs on it.
But clustering was cheap and easy, in that if I wanted to push my developers to a single portal for accessing VMs (as opposed to accessing ESX servers), I had to implement everything required to setup the clusters.
If you are in the hosting business and have multiple clients VMotion (and future Live Migration) is invaluable. The same goes for companies that internally are hosting virtual servers for multiple departments. To qoute you “Only if it helps us do things easier and/or faster can we derive value from it”. This is exactly what VMotion does in this case.
I’ve been working as VMware (and MS Virtual Server) admin for a couple of years in this industry and one of the major pains is the coordination required when the underlying host or operating system has to be patched, updated or have hardware replaced etc. Since a physical host contains multiple virtual machines you have to contact each individual machine owner/responsible to coordinate downtime. Different clients and departments will have different SLAs or agreements. This is always a pain with single server hosts, both for ESX and MS Virtual Server/Hyper-V, but with VMotion you can move your 10-20 virtual machines of the host, during daytime, without impact and replace hardware or patch the underlying ESX software.
I can not stress enough how valuable it is to not having to restart the actual virtual machine. As it is with physical servers, a virtual machine might only be one part in a bigger server farm handling an application. If the machine is a database server for example, suddenly you have another layer of coordination because all application servers dependant on this database server has to be coordinated as well, and some might even have special restart procedures because the above application might not reconnect automatically.
All that coordination overhead when maintaining the underlying host is gone with VMotion. An overhead that just get bigger the more virtual machines you have on your host. In practice, without VMotion, it can lead to more total downtime on your application than if they where on stand alone servers and not virtualized, since you now also have to take into account planned downtime for the underlying host.
Another benefit with VMotion is that you also can do Storage VMotion. For example, if the underlying SAN LUN runs out of space you can move the entire VM, and its data, to a new bigger LUN without downtime. We also utilized Storage VMotion when upgrading our SAN environment. We moved 250+ virtual servers, from the old SAN to the new SAN without a single stop or downtime required on any of the virtual servers.
From a service level fulfillment perspective, in your daily operations, VMotion is a major help and has true real world usefulness. Regarding complexity, at least on ESX level the multipath software and cluster technology is built in, so I would say that it is a tiny bit easier than a Microsoft cluster, where you have to take into account HBA firmware/drivers and compatibility with Windows Service Packs and more.
The value you get from an ESX cluster with VMotion, HA and DRS, in a hosting business environment, is incredible though. Live Migration will provide the same benefits.
I find the first part of your argument valid: MS Clustering does add complexity, most of which very few people understand. But I don’t really see how that ties into VMotion. If you are concerned that the added complexity of clustering isn’t worth MS Live Migration, that could certainly be true.
VMware’s VMotion technology by itself doesn’t add any additional complexity, so it’s technically incorrect to use VMware’s VMotion and Microsoft’s Live Migration interchangeably.
You also state that VMware’s VMotion technology isn’t used by many customers. I don’t agree with that, especially since it isn’t backed by any data. I would argue that automated VMotion technology – used for High availability, Distributed Resource Scheduling, Automated guest updates, host updates, or Power management (the list can go on) is all based off the fact that you can automatically optimize your datacenter without *any* downtime. VMotion is the basis for a self-regulated, self-correcting datacenter. I think that is certainly valuable!
I completely agree with Mangnus.
Even if you are in a “traditional” business running more than ten virtual machines you will find VMotion invaluable.
Having a clustered approach with all the associated complexity is what you have to pay when you are moving to virtualization (too many eggs in a basket…).
The good thing is that, in many cases, we can stay away from all the cluster complexity at the virtual machine layer because they all “inherit” the underlying high availability.
The secret is keeping the built in cluster technology as simple as possible and, in that, ESX really rocks.
Hi there.
Just to clarify, I never said “…that the additional complexity introduced by online VM migration techniques like VMotion is not relevant since they won’t be used anyway.”
What I said is that given where we are with the capabilities of the platforms, the autonomics, orchestration and provisioning maturity of solutions today, that I maintained that most people do not use tools like VMotion outside of planned manual activities.
You can see my summary here:
http://www.rationalsurvivability.com/blog/?p=764
Thanks,
/Hoff
Hi Christofer,
sorry for the misquote. I used the german news article I linked to as a source. Maybe during translation things got simplified a bit. Anyway, I have changed the quote in the article and hope this is OK. If not, please let me know.
So…compared to MSCS vMotion setup is trivial (a bit of config…but really quite simple overall).
Both in my previous job as senior network/systems engineer and now as a systems engineer for a VAR, vMotion is invaluable.
vMotion is what allows me to treat multiple servers as a combined pool of resources — dynamic rebalancing of VM’s across servers as utilization changes inside VM’s saves tons of time (and just works).
I could ramble about this for a while….but for the environment I built (previous job) and many of the environments I’m consulting in (current job) vMotion is absolutely invaluable (as well as DRS layered on top).
I can see what you’re writing, but i only read gibberish.
It’s like saying that “Anti-lock brakes, ESP and seatbelts add to the complexity of driving.”
I totally agree that MS Clustering adds a thick layer of complexity, but vMotion and DRS ?
I’m stumped.
The fear of doing things better is obviously out there, its some sort of job-protection scheme, but I don’t know one installation that doesn’t use vMotion with DRS. The only arguments i’ve heard against it is based on pure ignorance and/or lack of understanding the technology. It’s like listening to creationists and trying to get a grip of all the inconsistensies in their “truthology”i
We have had MS clusters since on NT 4, Windows 2000 and Windows 2003. In all versions the clusters were complexes and unstable (less in modern versions).
We have a VMware cluster since 1 year and it was a joke to configure it. In less than 15 minutes you have the cluster running. We use DRS in full automated mode having tens of vmotion per day without any issue.
All the people I know is using VMotion and DRS in production environments.
So if you are doing this question I think yor didn’t have managed a VMware cluster
You arguments does not have any validation, it is a simply guess.
VMotion is used by EVERYONE, every single customer. It is such a invaluable feature, that one might reasonably think your article is severally biased. Simply nonsense, to confuse miss-informed people.
I think the answer is not
Clustering = Complexity it\\\’s MS Clustering = Complexity
VMWare\\\’s power is in making clustering simple, it\\\’s the killer feature. If live migration does the same job, well done Microsoft for managing to copy the feature, maybe the MS innovation part is being able to give it away for free.