EqualLogic User Conference – Day Two Recap

I could have titled this post “Why my brain is full”, or “My experiences drinking from a firehose”. That’s how I feel after yesterday. Day two of the EqualLogic user conference was wall-to-wall information overload.
As I have said before, I am usually underwhelmed by conferences. All smoke and no fire. But this was NOT the case here.
We started the day hearing from Dell’s leader in enterprise strategy, followed by sessions on networking, VMware integration, Dell’s HIT kit, ASM for Windows, ASM/VMware edition, MPIO, solid state drives, and many others.
The only complaint I heard from attendees was that the sessions were so back-to-back that we didn’t have time for a deep breath (or brain reboot) before running to the next topic. Crazy good stuff.
A highlight for me was my final session covering the top 10 questions that come up in EqualLogic support and how to solve them, led by Vernon Miller. He’s a great teacher in my opinion and it reminded me of sitting in a college classroom. We covered a lot of ground on what might seem like basic topics, but things we’re all likely to encounter at some point.
The only area for improvement could have been the lunch session led by an outside storage consulting firm. Really good, brilliant guys, but the topic of iscsi trends was too heavy for a lunch session and was telling most of us what we already know to be true… that is “iSCSI can be fraught with pitfalls if you make bad choices.” But most of us haven’t experienced that BECAUSE we are already on EqualLogic.
To conclude, day two was a big success. Dell has done a great job and continues to confirm what we already know to be true, that we made the right choice on storage. So now my brain is still full and we still have another day to go. Time to get off this Dart train and get educated. Wish me luck.

Snapshots in VMware and how to survive them.

vmwareWe survived a rather scary day with one of our main file servers on Sunday.  This server was one that we had only virtualized several weeks before and contains a large amount of critical data. I’m a huge fan of VMware and their products have transformed the way we do IT at Watermark, but yesterday was not fun.

Sunday morning I received a call that the server wasn’t responding, and on further review noticed that the server’s data store was completely out of space.  The server would start for a few minutes, but then error with “There is no more space in the redo log for servername-00002. You may be able to continue by freeing disk space on the relevant partition.”  This was the beginning of our lesson on VMware snapshots.  ***side note, we have Gold Support for VMware, which you think would be good, but no… if you want support outside the hours of 6am-6pm M-F, you need platinum support… Nice***  But I digress.

Last Saturday we had taken a snapshot which we had subsequently forgotten about.  When you take a snap in VMware, the system puts the original VMDK (virtual disk file) into a “holding pattern” and begins to write changes to a new virtual disk file, in our case the servername-00002 file.  The best practice of course, is to do a snapshot, make your changes, and then immediately delete the snapshot; at which point all of the changes will be written back into the original “holding pattern” VMDK and all is well.  Unfortunately, the system doesn’t do anything to remind you that the snap still exists if you forget to do this.  At the time of our discovery, the new 00002 file had grown to the size of 21 gigabytes and had filled up all of the available disk space.  This to me seems like something VMware should implement, a reminder that snaps are growing like crazy and about to take you out at the knees.

So our immediate course of action was to keep the server stopped (it wouldn’t run for more than a few minutes anyway before falling over), and get a complete copy of our file system from the SAN as a backup.  After copying nearly 60 gig from the SAN to a different location, it was time to attempt removal of the snapshot.

We went into Virtual Center, under Snapshots, and snapshot manager and saw the snapshot from last Saturday that we wanted to remove, and promptly removed it.  The task started and then hung at 95% for about fifteen minutes, at which we received a message that the “Operation Timed Out.”

Now, here is the kicker. You would think that a message like that would be a prompting to try again, but after lots of research it appears that the process has not really timed out at all.  Because the “tracking changes” VMDK is so large, it has lots of data to roll back into the original.  So in reality, the process is still running in the background and you just need to give it time to finish.  In fact, many people have said that reissuing the “remove snapshot” command will in-fact kill your data.  Not good VMware.

Fortunately, we found this information out before trying to remove the snap again.  Surely enough, two hours later, the process finished running and we were back to our original VMDK file.  The server started up with plenty of storage available again and all is well.

Like I said earlier, I absolutely love VMware, but I will not be using the snapshot capability in the future.  I think I will stick with EqualLogic snapshots, which seem to be faster AND safer.

I’d love to hear your comments on what we did right/wrong and how VMware has worked for you.