Thursday
Apr242008

DELL: It's not a bug, it's a feature...

Sigh. This kind of stuff is really annoying. I'm in the process of building up a storage system using some of the latest kit from DELL and just ran into some very interesting and annoying problems.

The setup is two DELL R900s coupled to MD1000s with the latest PERC6/E SAS controllers. Our initial benchmarks on the system are really quite impressive. Now I've moved onto the acceptance tests to validate the way the system reacts to various types of failure and how it recovers.

I'm using SANMelody on the systems to form a high availability SAN and I have a set of standard failure tests that I've run on various similar setups using HP and IBM equipment, mostly MSA30 and MSA50 disk bays and various other JBODs. One test is a brutal crash of the bay that's acting as the primary storage. The reaction is just as expected, all of my servers fail over gracefully to the second server, even under extremely high IO load from multiple ESX Servers with VMs running IOMeter. The machine comes back up and the two servers agree that they can't trust the data on the crashed system so the mirrors are cleaned and automatically resynchronised. It takes a while but even with the IOMeter load hammering the backup server it puts everything back and goes merrily along its way.

All of my other standard tests of gracefully stopping the SANMelody service, cutting the replication link, etc., all work as expected. Very smooth and everyone is happy.

Then I get to the power failure of a disk bay and my day goes to hell. Cut the power to the bay, let it sit for a while and watch SANMelody reroute IO to the other server, still no interruption of service. So far so good. I power the MD1000 back up and watch the SANMelody console waiting for the volumes to come online. Wait. Wait. Wait. That's not good. Every other disk system I've tested this on brings the disks back online automatically.

Rescan the disks - still nothing from the disk bay. Open up the Dell Open Manage web console and see that it has identified the disks as being a foreign configuration. That's bad. I reimport the disk configuration from the disks using the Open Manage console and the volumes start coming back online. A couple of things that are really not good here. I shouldn't need a manual intervention to bring my disks back. On top of that Open Manage has created a phantom hot spare in slot 22 of a 15 disk bay. I have no idea what happens to my bay if it tries to rebuild a RAID from this imaginary disk and I don't think that it will be pretty.

Going back to the classic trouble shooting techniques, it's time to reboot and see if things get better. They don't. First off, the controller really isn't convinced that it can use the disk configuration since it hasn't rescanned all of the RAID volumes so the boot sequence stops waiting on a confirmation to import the foreign configuration again and the system is now busy doing a background initialisation. Well, not really, but the UI really needs to clarify the difference between a background initialisation and a background validation. And my phantom disk is still visible.

Hello? Dell support? (after waiting 25 minutes to get to talk to someone). Explain situation. Response: that's the way it's supposed to work. If the controller sees the disks go offline, it refuses to bring them back online without a manual intervention to import the "foreign" configuration. I still have the ticket open on the phantom disk.

Now perhaps I'm being stupid here, but if it thinks that it's a foreign configuration, wouldn't that mean that it doesn't match with what's in the PERC? Since nothing has changed on the configuration, how could it be different? At the very least it should be able to read the disks configuration, compare it to the last known configuration of the controller and decide that it can remount the volumes? Older controllers from Dell used to let me set a switch to tell the system how to react so I could specify to always use the disk configuration, the card's configuration or wait for user input. I'd really really like that option back.

Now I wouldn't be that upset overall since I can manually import the foreign configuration without restarting the server (so even if it's 3AM, I can VPN in and access the console), but it then requires a background initialisation before it changes the state of the disks from foreign to online. And that hammers the disks and degrades my IO on the bay. On an MD1000 with 15 750Gb SATA drives, I'm good for a few days of validation.

I'm beginning to think that there are some serious problems with the current generation of Dell SAS controllers since I have another client that's getting some grief with random loss of the RAID configurations on their ESX boot volumes. The standard RAID 1 internal SAS setup (PERC6i) that you see everywhere and for no apparent reason some of the machines will lose the configuration and stop accessing the drives while the server is running. This plays royal hell with everything since it's a malfunction that does not trigger the ESX HA function as the OS is still alive (albeit on life support) but you can't ask the server to do anything.

Anyone else seeing odd behaviour from DELL SAS controllers?

Note: Yes, everything is using redundant power supplies, connected to separate electric feeds in an battery/generator backed data center, but sh*t happens so you have to be prepared and know how things are going to react.

Thursday
Apr242008

iriver's W7 portable media player gets reviewed

Wow - doesn't that look like a Newton MessagePad 2000?

iriver's W7 portable media player gets reviewed: "


(Via Engadget.)

Tuesday
Apr222008

Groupware Bad

Groupware Bad: "If you want to do something that's going to change the world, build software that people want to use instead of software that managers want to buy."

Nicely written little article on the perils of developing solutions that nobody wants to use. I think that this is exactly why Apple is seeing a resurgence these days. They're not targetting the buzz-word laden feature lists demanded by IT managers, but are designing applications that appeal to real people.

Monday
Apr142008

iTunes FUD

I can't believe the sheer amount of misinformation floating around out there concerning iTunes. Yet again I'm showing off OS X to someone who is curious about how I use it and how it works. When the conversation turns to the iPhone, there's an immediate negative reaction regarding the necessity of using iTunes.

There are still people out there who believe that iTunes encodes everything it touches into some proprietary iTunes only format and doesn't read mp3's. Sigh.

Official announcement: iTunes can play back mp3 files.

iTunes depends on Quicktime for encoding and decoding of audio and video files. There's built-in support for a whole slew of standard formats. For a list of natively supported formats and when they were integrated into Quicktime at http://www.simnet.is/klipklap/quicktime/. On top of this list you can install Perian which adds a series of codecs to Quicktime to gives iTunes support for the following additional formats:

  • AVI, FLV, and MKV file formats
  • MS-MPEG4 v1 & v2, DivX, 3ivX, H.264, FLV1, FSV1, VP6, H263I, VP3, HuffYUV, FFVHuff, MPEG1 & MPEG2 Video, Fraps, Windows Media Audio v1 & v2, Flash ADPCM, Xiph Vorbis (in Matroska), MPEG Layer II Audio
  • AVI support for: AAC, AC3 Audio, H.264, MPEG4, and VBR MP3

If you're using iTunes to encode or rip your CD collection, you have the following choices:

  • AAC
  • AIFF
  • Apple Lossless
  • MP3
  • WAV

The only incompatibilities that you might run into are when you are dealing with DRM protected files purchased from the iTunes Store. It's worth noting that there a portion of the music sold on the iTunes store is available without DRM and as such is completely portable. But even with iTunes DRM protected music, you have the right to burn a copy to CD which you can play in any regular CD player, or re-import into your iTunes (or other) library in the format that you prefer.

Thursday
Apr102008

ESXi 3.5

Just in the middle of a VMware presentation where someone has finally explained clearly where it fits in the grand scheme of things.

I understood that in larger environments, the ability to drop-in new hardware resources without requiring your technicians to deploy the OS on local disks (although that is pretty easy to automate) was supposed to be the primary appeal. Going against it is the lack of Service Console for larger enterprises who have fairly well evolved supervision and management toolkits that install in the Service Console.

Until today, nobody was able to explain exactly what you actually got with ESXi. I was completely lost with the questions of price and licencing and now I think I get it. The option for having ESXi 3.5 included with your new server will come either free or for a nominal cost. Start thinking about ESXi as the free replacement for VMware Server or Virtual Server. You get rid of the OS and for free (or close to it) you can do server consolidation on local storage. It remains to be seen if there are technical locks that block you from using remote storage or if that's done on the honor system. You manage the server directly using the VIClient, which is a big step up from the web interfaces for the current generation of free servers.

Each ESXi server in this mode is an autonomous server, but you can take advantage of all the really cool bits of the VI3 toolkit by buying Virtual Center and the associated ESX licences based on what you plan to use it for. So get a real free hypervisor (which is intended to push back the idea of Xen as a free solution) and a clearly defined upgrade path.

Side note - I like the new naming conventions. It's a lot clearer that Server denotes applications installed on a host OS, ESX includes the Service Console and ESXi is embedded. Although wouldn't "ESXe" have made more sense?