EqualLogic User Conference – Day Two Recap

I could have titled this post “Why my brain is full”, or “My experiences drinking from a firehose”. That’s how I feel after yesterday. Day two of the EqualLogic user conference was wall-to-wall information overload.
As I have said before, I am usually underwhelmed by conferences. All smoke and no fire. But this was NOT the case here.
We started the day hearing from Dell’s leader in enterprise strategy, followed by sessions on networking, VMware integration, Dell’s HIT kit, ASM for Windows, ASM/VMware edition, MPIO, solid state drives, and many others.
The only complaint I heard from attendees was that the sessions were so back-to-back that we didn’t have time for a deep breath (or brain reboot) before running to the next topic. Crazy good stuff.
A highlight for me was my final session covering the top 10 questions that come up in EqualLogic support and how to solve them, led by Vernon Miller. He’s a great teacher in my opinion and it reminded me of sitting in a college classroom. We covered a lot of ground on what might seem like basic topics, but things we’re all likely to encounter at some point.
The only area for improvement could have been the lunch session led by an outside storage consulting firm. Really good, brilliant guys, but the topic of iscsi trends was too heavy for a lunch session and was telling most of us what we already know to be true… that is “iSCSI can be fraught with pitfalls if you make bad choices.” But most of us haven’t experienced that BECAUSE we are already on EqualLogic.
To conclude, day two was a big success. Dell has done a great job and continues to confirm what we already know to be true, that we made the right choice on storage. So now my brain is still full and we still have another day to go. Time to get off this Dart train and get educated. Wish me luck.

Why I Love our EqualLogic SAN

Last night the IT team at Watermark completed the addition of a new EqualLogic array into our existing SAN solution.  There are many things that I love about EqualLogic, but first and foremost is the simplicity.  Here at Watermark, we really aren’t an IT shop.  We have two full time IT people (myself included) and a part time helpdesk person.  We are currently supporting a ton of users, a large campus, Windows servers, Mac OS, Web Development, VMware ESX, etc.  The last thing I need to worry about is allocating hours of time to support our storage solution.  That’s where EqualLogic has been a huge win for us.

2EQL

We purchased our first EqualLogic array one year ago from VR6 Systems (whom I HIGHLY recommend you talk to if you are looking to purchase).  It is a sixteen drive, 8 Terabyte solution meant to drive the storage for our VMware ESX implementation project.  Since that time, we have moved our Exchange 2007, Sharepoint, File Serving, Print, SQL servers, a domain controller, etc over to VMware with the data stores on the SAN.  We’ve experienced great performance with few issues at all.  A perfect result for a small-to-midsize IT team.

We have also seen our data needs grow tremendously over the past year. We began to look at cloud storage with Amazon S3 and others for some of these growing needs but at the end of the analysis, the amount of time uploading these files, storing them, downloading them again for reuse was going to become more costly over 3 years than looking at new EQL storage solution not even considering the speed benefits of local storage.

Last night we upgraded the firmware on our new array and the existing one to the latest 4.1.4 version.  No major issues on either and it took very little time.  At that point we ran the wizard provided by EqualLogic, which found the new uninitialized array.  A series of maybe 4 questions later and the new array was added to the existing group.  Just… Like… That…  It really is that easy.  Adding the new 16TB of storage to the existing pool and right away, that space was available.  Within minutes, the (now one large) array was moving data over to the new spindles.  Over the next week, it will automatically move the data to the right number of disks to give us the best performance.  We did notice last night that one of the iscsi connections from VMware had also automatically starting using the NICs on the new controller, so once again, we are starting to see the benefit of adding more gigabit connections to our data.

I love that we get to enjoy working with such great technology, especially when doing it for ministry.  It’s a blessing.

Snapshots in VMware and how to survive them.

vmwareWe survived a rather scary day with one of our main file servers on Sunday.  This server was one that we had only virtualized several weeks before and contains a large amount of critical data. I’m a huge fan of VMware and their products have transformed the way we do IT at Watermark, but yesterday was not fun.

Sunday morning I received a call that the server wasn’t responding, and on further review noticed that the server’s data store was completely out of space.  The server would start for a few minutes, but then error with “There is no more space in the redo log for servername-00002. You may be able to continue by freeing disk space on the relevant partition.”  This was the beginning of our lesson on VMware snapshots.  ***side note, we have Gold Support for VMware, which you think would be good, but no… if you want support outside the hours of 6am-6pm M-F, you need platinum support… Nice***  But I digress.

Last Saturday we had taken a snapshot which we had subsequently forgotten about.  When you take a snap in VMware, the system puts the original VMDK (virtual disk file) into a “holding pattern” and begins to write changes to a new virtual disk file, in our case the servername-00002 file.  The best practice of course, is to do a snapshot, make your changes, and then immediately delete the snapshot; at which point all of the changes will be written back into the original “holding pattern” VMDK and all is well.  Unfortunately, the system doesn’t do anything to remind you that the snap still exists if you forget to do this.  At the time of our discovery, the new 00002 file had grown to the size of 21 gigabytes and had filled up all of the available disk space.  This to me seems like something VMware should implement, a reminder that snaps are growing like crazy and about to take you out at the knees.

So our immediate course of action was to keep the server stopped (it wouldn’t run for more than a few minutes anyway before falling over), and get a complete copy of our file system from the SAN as a backup.  After copying nearly 60 gig from the SAN to a different location, it was time to attempt removal of the snapshot.

We went into Virtual Center, under Snapshots, and snapshot manager and saw the snapshot from last Saturday that we wanted to remove, and promptly removed it.  The task started and then hung at 95% for about fifteen minutes, at which we received a message that the “Operation Timed Out.”

Now, here is the kicker. You would think that a message like that would be a prompting to try again, but after lots of research it appears that the process has not really timed out at all.  Because the “tracking changes” VMDK is so large, it has lots of data to roll back into the original.  So in reality, the process is still running in the background and you just need to give it time to finish.  In fact, many people have said that reissuing the “remove snapshot” command will in-fact kill your data.  Not good VMware.

Fortunately, we found this information out before trying to remove the snap again.  Surely enough, two hours later, the process finished running and we were back to our original VMDK file.  The server started up with plenty of storage available again and all is well.

Like I said earlier, I absolutely love VMware, but I will not be using the snapshot capability in the future.  I think I will stick with EqualLogic snapshots, which seem to be faster AND safer.

I’d love to hear your comments on what we did right/wrong and how VMware has worked for you.