Snapshots in VMware and how to survive them.

vmwareWe survived a rather scary day with one of our main file servers on Sunday.  This server was one that we had only virtualized several weeks before and contains a large amount of critical data. I’m a huge fan of VMware and their products have transformed the way we do IT at Watermark, but yesterday was not fun.

Sunday morning I received a call that the server wasn’t responding, and on further review noticed that the server’s data store was completely out of space.  The server would start for a few minutes, but then error with “There is no more space in the redo log for servername-00002. You may be able to continue by freeing disk space on the relevant partition.”  This was the beginning of our lesson on VMware snapshots.  ***side note, we have Gold Support for VMware, which you think would be good, but no… if you want support outside the hours of 6am-6pm M-F, you need platinum support… Nice***  But I digress.

Last Saturday we had taken a snapshot which we had subsequently forgotten about.  When you take a snap in VMware, the system puts the original VMDK (virtual disk file) into a “holding pattern” and begins to write changes to a new virtual disk file, in our case the servername-00002 file.  The best practice of course, is to do a snapshot, make your changes, and then immediately delete the snapshot; at which point all of the changes will be written back into the original “holding pattern” VMDK and all is well.  Unfortunately, the system doesn’t do anything to remind you that the snap still exists if you forget to do this.  At the time of our discovery, the new 00002 file had grown to the size of 21 gigabytes and had filled up all of the available disk space.  This to me seems like something VMware should implement, a reminder that snaps are growing like crazy and about to take you out at the knees.

So our immediate course of action was to keep the server stopped (it wouldn’t run for more than a few minutes anyway before falling over), and get a complete copy of our file system from the SAN as a backup.  After copying nearly 60 gig from the SAN to a different location, it was time to attempt removal of the snapshot.

We went into Virtual Center, under Snapshots, and snapshot manager and saw the snapshot from last Saturday that we wanted to remove, and promptly removed it.  The task started and then hung at 95% for about fifteen minutes, at which we received a message that the “Operation Timed Out.”

Now, here is the kicker. You would think that a message like that would be a prompting to try again, but after lots of research it appears that the process has not really timed out at all.  Because the “tracking changes” VMDK is so large, it has lots of data to roll back into the original.  So in reality, the process is still running in the background and you just need to give it time to finish.  In fact, many people have said that reissuing the “remove snapshot” command will in-fact kill your data.  Not good VMware.

Fortunately, we found this information out before trying to remove the snap again.  Surely enough, two hours later, the process finished running and we were back to our original VMDK file.  The server started up with plenty of storage available again and all is well.

Like I said earlier, I absolutely love VMware, but I will not be using the snapshot capability in the future.  I think I will stick with EqualLogic snapshots, which seem to be faster AND safer.

I’d love to hear your comments on what we did right/wrong and how VMware has worked for you.