Is VMware Clustering / VMotion Complex Compared to Microsoft Failover Clustering?
My last post on VMware VMotion urged several readers to protest, maybe because of its provocative title. What I did was to compare VMware clustering with Microsoft failover clustering. I got to the conclusion that both significantly add to the complexity of the environment. Interestingly, most commenters said, yes, Microsoft clustering is complex, but no, VMware clustering is not, yet failed to explain exactly why.
The complexity of clusters only partly lies in the software actually doing the clustering. What makes the stuff complex is the infrastructure you need for clustering. Let us compare Microsoft’s and VMware’s requirements:
Requirements for Microsoft Failover Clustering
Instead of defining hard requirements, Microsoft gives recommendations. In practice the following setup is often used:
- Shared storage on a SAN
- Dedicated private network for intra-cluster communication
- Similar hardware and software on all nodes
Requirements for VMware Clustering
In contrast to Microsoft, VMware very clearly defines hard requirements:
- Some kind of shared storage, typically on a SAN
- Dedicated private network for intra-cluster communication
- Compatible set of processors in all cluster nodes
It’s the Infrastructure!
Now, these requirements are strikingly similar. What should be noted, though, is that with VMware they are actual hard requirements, while with Microsoft they are only recommendations.
So, where does complexity come into play? With clusters, you simply have more components to worry about than with single servers. When administering a node, you always have to keep in mind that there are other nodes that should be configured similarly. Your network guys must setup and maintain an additional logical network. But most importantly, you need a SAN. Don’t tell me, SANs are simple, because they are not. And, since SANs really need to highly available, you need everything redundant: HBAs/NICs, switches, I/O controllers and so on. Just understanding a diagram of such a setup is way over the head of a large portion of the typical IT staff, let alone design and manage it.
As before, I want to point out that I am not against clusters in any way. But I happen to think that managing clusters along with all the required infrastructure is not a trivial job. In many cases, the benefits of a clustered solution will outweigh the trouble, but there may be situations where clustering is not justified.
Also, I try not to be biased. I have already written articles “pro VMware and con Microsoft”.
When comparing the complexity of VMware and Microsoft clustering I cannot find any fundamental difference. This is neither pro nor con VMware, but there seem to be a great deal of VMware evangelists around that take it as an affront if any of VMware’s products are considered less than miles ahead of the competition, especially Microsoft.
Of course, I am willing to learn. If you feel that I have misjudged the situation, please tell me exactly where. Do not write about how great or powerful VMotion is – I already know that – but explain why it requires a less complex infrastructure than Microsoft failover clustering.
I think there exists a bit of confusion around what equivalent Clusters is within these 2 products.
Vmware’s equivalent to MS’s Failover cluster is the HighAvailability(HA) cluster – not to be confused with the DynamicResourceScheduling(DRS ) cluster.
For the HA cluster there exists only one requirement and that is shared storage ie. NFS/iSCSI/SAN space. Obviously the cluster servers need to be able to ping one another and that is enough, no private network needed.
HA works between AMD and Intel cpu’s
For the DRS cluster the requirements are:
– shared storage
– compatible CPUs
(this is not quite a requirement as the EVC feature support VMotion between Intel Penrynn and AMD Opteron(Rev E,F) CPU’s – and newer
Once again, no private network required
Apple vs lemon – you decided which is which ;-)
thanks for pointing out the difference between a HA cluster and DRS cluster. However, rereading VMware’s library page I cited above it sounds to me as if the requirements for “a compatible set of processors” and a private network apply to VMotion, too. It says, for example: “VMotion requires a private Gigabit Ethernet migration network between all of the VMotion-enabled managed hosts.” What do you make of that?
“private network” not required as I run Ha and DRS on public networks…
Best practice would be to have private network but that is purely for performance/security reasons.
As this article was comparing HA with MS Failover cluster I thought to try and clear the easily confusable HA and DRS cluster types.
And the bottom line is that HA only “requires” shareable storage.
PS. Great background info available about the technical aspects can be found at: http://www.vmware.com/support/
Uuups, I seem to have accidentally deleted a comment by Fabio Secchia. Sorry! Got confused with all the spam comments…
hoping to be not “antispammed” again…
I suggest you to have a look at how easy is to implement a VMware cluster using an NFS NAS (as you know something not possible in Microsoft cluster solutions).
If you follow that path you can keep the complexity at a minimum level:
– No need to play with “complex” FC SAN gergo (zoning, masking, multipathing…)
– Simple volume provisioning (using almost all NAS that I know is a “next next yeah yeah” Microsoft style operation)
– Simple and effective IP connection (bonding two ethernet interfaces is not rocket science, you usually have to do it even in the most basic single server scenario)
– No need to have a “Quorum” device
– No need to have an “heartbeat” primary and secondary connection
– No need to know what a “Split brain” situation is
– Very simple “scaling out” (just provision a new “building block” hardware device and drag it into the vCenter cluster icon and your cluster grows)
If this is not enough to convice you that this stuff is REALLY simple (at least compared with all the other cluster solutions that I have to manage all days), you can add a simple ethernet gigabit (switched) connection between nodes (and some money for the licenses too..) and it’s all you need to enable that little piece of magic called vmotion (something never seen before even in what someone calls “high end cluster solutions”).
From my point of view, vmotion and dmotion (aka storage migration) have very little to do with clustering issues (at least from the high availability perspective).
They are just VERY CLEVER solutions that exploit a clustered infrastructure to help us to work better in this VERY COMPLEX world!
thanks for posting again!
I agree with what you write. The complexity I was thinking of is mainly related to SANs. I my experience, no matter what clustering solution you use, if a SAN is involved, it gets complex. I hope you can let that one pass ;-)
Apples and Oranges.. the two terms are not the same. Clustering in VMware is more of load balancing based off resource utilization. It moves Vm’s based on a set of rules. HA is more in line with MSCS, one host dies, the other starts up the VM’s…
These are not to be confused.
While working for a Florida Biopharmaceutical company, I ran a MS cluster that hosted our Prd/Dev SAP system. It ran on HP servers with an HP VA SAN/Brocade switch hung off it. Yes, it had a Quorum drive, a heartbeat LAN (an extra NIC, no biggie) and public 100BaseT and private gig. (backup use) networks. It was more complex than a pair of HP DL servers, but it provided SLA beating up-time, +4 nines. We also had a server farm of nearly 100 servers, 2 other SANs, one for email/file service and one for other Oracle data-stores, fiber attached backup libraries and synchronized NDS and AD directory services.
We did have the hardware, OS, SAN and fiber expertise to run our systems, predominantly supplied by myself, and high level support from HP. I would think that any shop running SAP, SANs or other highly technical systems would have, and need, both expertise and and high level vendor support. SAP, and other systems, become so critical to a company’s ability to function, support is true requirement for implementation.
And you are correct, it was a beautiful thing to watch SAP and Oracle fail-over from one node to the other.
The vmware cluster is simplest in comparison to an MS cluster as there is no configuration of applications or computability requirements or as others have stated a quorum drive.
Jim is correct as the two are totally different
As for the SAN part a FC SAN is relatively simplistic a FC card is little more than a SCSI HBA with a different connector on the back.
Zoning, this is relatively simple as well, just keep ALL zones to having 2 members only and you are set. This is far simpler than using iSCSI and having to keep different IP subnets all over the place. This means that most physical servers have 2 HBA and the SAN array has 4 targets so there will be 8 zones. (4 per switch)
NFS has its place in the VMware/ESX world, and that is as a backup device to put your backups of entire VM’s. Any IP based storage is not so good as it is not able to handle high utilization and tends to die badly when its pushed hard.
Fault tolerance is the next major jump in this area VMware HA is nothing more than an auto-restart daemon, this has some issues of its own.
FT has the same VM running on 2 hosts as once, when one goes down the NIC is connected on the other and the applications are still in memory, apart from 1-2 packets being dropped (arp tables updates on switches) there is no interruption due to a host failure.
Apples to Orange comparison.
VMware cluster is the building block of a VI infrastructure. MS Clustering is to protect an individual application.
An enterprise VI infra will always have a cluster.
Would you recommend to a customer to not use an ESX Cluster because it is complex ?
Would you say a customer to not use VMotion to avoid complexity, since “he/she will not use it” ?
David I am going to have to disagree with your assessment of IP storage, particularly NFS. I think Oracle would also tend to disagree as their direct NFS implementation out of their database actually out performed fiber on the latest NetApp GD OnTap release. The World’s largest Oracle implementation (Oracle’s ASP hosting center in Austin, TX) runs NFS on NetApp.
While I agree that zoning is simple, fiber just doesn’t scale well in large VMWare environments. Try setting the pathing for 150 LUNS on 20 ESX servers, believe me it is a chore! Again, if you want to add storage you are going to need to rescan the HBA’s on every ESX server in your environment. I don’t know what IP storage systems you have been exposed to, but it sounds to me like you need to look again at some of the options that are around now.
Main difference is between Windows Clustering and VMware clustering is that.
1. If Operating system crash or fail to reboot than no use of VM Clustering. But Windows clustering will you to run your application up and running in other machine.
2. RPO and RPO will be affected on few technologies if your VM damages. Example database server exchange server. in windows clustering still you up and running.
3. Still you have better performance when you use Windows clustering.
4. Windows clustering still you can use server for multiplme virtualize applciation with failover.
5. Windows clustering(Application virtualization came in this world. VMware came later on with resouce “OS” isoletion.
6. Only VMware good feature is that. resouce utiliztion, if you add more VMs and you will suck.
From this point you can decide which is PROS and CONS.
If you find attractive answers shoot reply. Thanks.
The thing to remember that microsoft clustering provides application level awareness. It is a microsoft supported solution and is providing new technologies for seamless client application failover such as smb transparent fail over in smb 3.0.
In cases of an host failure, VMware will restart (cold vmotion) the host on another server but the application or service will be down for a brief time while the operating system is started. In terms of Microsoft clustering the host will detect the host is unavailable and the fail-over will occur in seconds to another node operating in standby ready for the workload. This is a much faster recovery time and does not require any intervention.
You can virtualize the microsoft clustering solution, you will lose some benefits i.e.. Snapshot, vMotion, DRS, Storage DRS, Storage vMotion.