Can VMWare be Evil?

by Mike Levin SEO & Datamaster, 06/18/2006

To my knowledge, this is the first time the political aspects of virtualization has been explored on the Internet.

So, server virtualization has a dark side that is rarely spoken about. And that is how so many decisions that get hardwired and cast-in-stone with your selection of hardware are suddenly detached from those hardware decisions. It is quite possible that your mission critical application, which one day is receiving the maximum resources allowed by your most powerful multi-processor server is the next day on a resource-starved, should-be-retired server… because your sys admin is pissed at you. Yes, it happens.

This is one such story.

That’s right. A very powerful, flawless application can be made to appear flakey and unreliable through virtualized hardware. Availability of your application can effectively be dialed-down or dialed-up according to political whims–with much greater ease and less detectability than the equivalent native hardware. There are so many variables that virtualization introduces to the equation, that unless you have admin-level rights to the host server, it will be very difficult for you to isolate the reasons for performance drops. You could be sharing resources with four other virtual machines, each one with a heavy server-load, starving you for cycles. Your VM could have been moved to an entirely different host machine. You don’t know. You have no recourse. You look bad through no fault of your own.

Thanks, virtualization.

Taken to the extreme, an out of control sys admin could use this technique to plague an application developer into frustration–never quite being able to isolate performance bottlenecks and optimize the app. Every time you make some ground, some inexplicable new slowdown hits the system, and standard diagnostic procedures and load tests are meaningless… because… you’ve… been… virtualized.

In the very worst case scenario, this can be used as a political weapon by a sys admin who has an axe to grind with an application developer. They can easily make the case that the application is not optimized, and that the developer is a poor programmer. If you don’t have objective hardware benchmarks, the standard application developer has no recourse. They are simply trounced on and ripped to shreds in an attack that is invisible to all but one person, who gets to sit back and enjoy the havoc.

I’ve seen it. It’s ot pretty. The ironic part is that it’s so difficult to explain that the boss won’t get it, which brings a huge grin to the sys admin’s face.

Yes, an application developer can try to take their development system home and run it on dedicated hardware. But should the burden of proof be laid on them? And not all developers are allowed to take their code home. And should it even be an issue when the entire company agrees that the application is mission-critical, and as such should have every ounce of resources a server has to offer? There are often times the hardware is chosen particularly for the application, such as SQL Server, in which case virtualization is particularly harmful.

For example, the SQL Server hardware spec calls for multi-processors, because SQL Server is optimized for multi-processor multi-threading. In any version of VMWare Workstation before 5.5, the virtual PCs could only access one processor. Another attribute of the SQL Server hardware spec is LOTS of memory, for many obvious reasons. With VMWare, the host OS needs its slice of memory. You can run up to 5 instances of virtual machines–whose memory CAN’T expand and contract as needed. Meaning, you’re likely to only get about one sixth of the memory in the entire server, throttling SQL Server’s performance even further. Let’s move onto the hard drive RAID configuration. The SQL Server spec recommends 6 or 8 drives, so four can be dedicated to a RAID-10. Striped and mirrored offers the best combination of speed and fault tolerance. The remaining four drives can be two sets of mirrors: one for the OS and app, and the other for the transaction log. This way, there is minimal read/write contention between the transaction log, the data and the app. Neeless to say, hard drives virtualized through software blows away almost all the speed benefits of this spec, especially if the locations of the virtual drives don’t correspond to the real-hardware spec.

I haven’t even mentioned the cycles that are stolen merely from the virtualization process. Admittedly, this is probably where the least speed loss is occuring compared to the factors outlined above. But by VMWare’s own admission, you can lose up to 20% CPU performance from virtualization. Is that acceptable in a mission-critical app that needs near-100% availability?

And a final concern is what happens after a power-loss. Dedicated hardware can be configured to gracefully reboot, run through a system-check, and spring back into operation. VMWare Workstation in particular is not nearly so graceful. Because each VM session is running as an application instead of a service, a user on the host server needs to log in, and start each session. So down-time after a power-loss is measured in hours (sometimes, days) instead of minutes. Yes, the server versions can alleviate this problem by running the session as services. But my sys admin inexplicably clung onto the Workstation version, so I was repeatedly bitten by this factor as well–more apparent flakiness in the app that is not in the app.

During the move to VMWare, the performance-loss estimates that were given to me by this sys admin were as low as 10% of original speed. But he adamantly refused to do a real-hardware vs. virtualized hardware test. It literally could have been a 5-minute TPS benchmark test–once the hardware was actually configured according to the SQL Server spec. But the six drives were incorrectly configured as three sets of mirrors (3 RAID-1’s), losing the striping speed boost of RAID-10. It was based on VMWare 5, so only one processor was accessible. And it was sharing time with other VMs. Once the servers were liberated, I was able to see that the performance drop under his scheme would have been nearly 60 to 75%. In other words, my app would have appeared to be running as much as 25% of its true speed.

We would have NEVER known this had I not liberated the servers. And everyone would have cast suspicious glances at me for the flakiness.

This was particularly eerie considering the sys admin in question repeatedly criticized me for my programming style. What I later realized was that he simply was never exposed to an agile programming framework before. I have been vindicated on that front as well with the proliferation of systems that make intensive use of ad hoc SQL, such as Ruby on Rails, and the Hot Spot optimization and execution plan predication/retention that’s constantly improving in SQL Server. Not everything needs to be a stored procedure anymore in order to be called optimized. It helps, but look at your SQL compilations per second and CPU usage EVEN WHEN using ad hoc SQL queries. You will be surprised. “Some” sys admins will blindly parrot the opinions of old-school database admins without knowing the state of technology. So, this entire VMWare mis-adventure was that I was subjected to was, in my opinion, a sys admin’s exercise in making his own false assertions true by whatever means necessary.

And VMWare was unfortunately those means. Needless to say, this experience soured me on the company, and made me more suspicious of the motives behind virtualiztion in some cases.

Virtualized hardware, and VMWare in particular, was used as a political weapon in this instance. A sys admin who decided to make an application developer look bad had a lot of power to do so. Dial-up performance, dial-down performance. Some might argue sys admins had this ability before, but that doesn’t hold up when you start using diagnostic tools and benchmark performance tests. It’s all about eliminating variables that could be adversely affecting performance. And with virtualization, you are isolated from these variables, just as you are isolated from the hardware–sometimes, even which VMWare host you’re actually even running on! In some of the most extreme conditions, you could be that you could be getting 75% performance increase just by running on the native hardware.

While VMWare is very nice for sandboxes and development systems, give careful consideration as to whether you use it for mission-critical apps, where availability is a concern. In fact, I encourage app developers to respond to attempts to virtualize mission critical apps by questioning whether it wouldn’t be wiser to explore the other direction: clustering. One server failing can take down a whole enterprise with virtualization. In contrast, dedicating multiple servers to a single app moves closer to fail-safe reliability AND higher availability.

I’m sure virtualization and clustering will come together beautifully some day in a perfect synthesis. Hypervisors and hardware-level support for virtualization are taking us closer to that vision. But it’s not all about CPU. It’s also about hard drives and priorities. Do you have to go through needles new I/O bottleneck layers? Which virtualized sessions have 100% availability requirements? Can you add load balancing and clustering techniques to bring the performance of a virtualized server up to that of native hardware that has been fully tweaked-out for performance, such as the SQL Server hardware spec? As an application developer, THAT is the day that I will hop on the virtualization bandwagon for production servers.