Entries in vmware (19)

Friday
Aug122016

Thunderbolt: a fast and cheap SAN

The lack up uptake by major vendors of Thunderbolt as a transport for storage systems has baffled me for a long time. For large systems there are a number of hard limitations that make it less than ideal, but in the smaller SAN/NAS space it seems like it would be a perfect replacement for SAS connected disk trays. 

This may be changing now that we are seeing some cluster interconnects coming on the market oriented towards leveraging the fact that Thunderbolt is basically a PCIe bus extension and we’re starting to see PCIe switches coming on the market from players like IDTAvago and Broadcom.

But getting back to practical applications, I started having a need for a portable SAN/NAS box to help with some client projects involving datacenter migrations or storage migrations where they needed a little extra swing space. So with the current state of the art there are quite a few Thunderbolt based storage systems that are well adapted to what I had in mind. But the first issue I ran into is that while Thunderbolt 3 on USB-C connectors is starting to appear in newer laptops, NUCs and MicroPCs, they are almost always single port setups and I wanted to be able to dedicate one port to storage and another port to the network interconnect. Which led me back to what appears to be the only small form factor dual Thunderbolt equipped machine on the market: the Mac Mini. Now this means that I’m “stuck” with Thunderbolt 2 over the Mini DisplayPort connections, but hey it’s only 20 Gbps so for a mobile NAS, this ought to be OK.

What pushed this over the line from something on my list of things I should do some day was the discovery of a relatively new line of disk bays from Akitio that have four 2.5” slots that will accept the thicker high density drives and are daisy chainable with two Thunderbolt 2 ports.

So with this in mind, my bill of materials looks like this:

And I completed this with a set of four 1 Tb drives that I had kicking around.

With all of this in hand, I started off with FreeNAS, but for some reason I couldn’t get it to install with a native EFI boot on the Mac Mini, so I ended up tweaking the configuration using the rEFInd Boot Manager to get FreeNAS running.

The basic install and configuration worked just fine, but for some reason I could never track down, the system would freeze once I started putting some load on it, whether local stress tests or copying data from another system. So about this time, I noticed that the latest Ubuntu 16 distribution includes ZFS natively and went to give that a spin.

In the first place, the Ubuntu installs natively and boots via EFI without a hitch so that simplified the setup a little bit. Then just a matter of installing the zfs utilities (apt-get install zfsutils-linux) and setting up an SSD pool and a couple of disk pools.

On my local network I am sadly behind the times and am still living in GbE-land, but my initial load tests of transferring some ZFS file systems (4 Tb of data) using zfs send/recv over netcat worked flawlessly and saturated the GbE link using either the built-in network port or the Sonnet.

Physical assembly

Similarly to the Mobile Lab I just built, I wanted to simplify the power cabling and management and limit the box to a single power plug. I did look at including a power bar and using the included adaptors, but that actually takes a lot of space and adds a significant amount of weight. Happily the Akitio boxes take 12V in, so it was just a matter of soldering some connectors onto the 12V lines out of the PSU and running one direct line from the plug over to the Mac Mini.

Then it was off to my metalworking shop to build a case to hold all of this which resulted in the following design:

 

Real life

I’ve got a project going where I’m working on a team that is consolidating a number of data centers where we deploy a local staging area equipped with 2 ESXi servers and a 24Tb ZFS based NAS. So from there I need to move the data to another data center and we’re leveraging the ability of ZFS to sync across different systems using the snapshots (as discussed here in my auto-replicate scripts). 

Given the volume of data involved, I do the initial replication manually using netcat instead of the script that uses ssh since ssh is CPU bound on a single core which limits the potential throughput. Using this method I was getting sustained network throughput of 500MB/sec. Yes that’s Megabytes, not bits. Peaks were hitting 750MB/sec. All of this through a Mac Mini…

Mobile NAS next to its big brother:

Miscellany

I’m usually trying to design systems to be as quiet as possible, and while there are fans on the Akitio boxes, they are low RPM and make hardly any noise. Using the included power adapters it’s actually very very quiet. In the final mobile configuration, the only thing that makes any noise is the fan on the PSU. So this setup could very well be used as a home NAS without having to hide it in the garage. If I were used this as a design spec for a home NAS, I’d probably start with an Intel NUC or Gigabyte BRIX with Thunderbolt3 using FreeNAS though for the simplicity of management and easy access to protocols other than NFS.

While it’s certainly easiest to do all of this over Ethernet, I can also extend the setup to be able to handle Fibre Channel with something like the Promise SANLink2 and the Linux FC Target software.

Thunderbolt 2 on the Mac Mini is extensible to up to 6 external devices on each chain so I could theoretically add two more Akitio boxes on the storage chain and another five more if I wanted to share the Thunderbolt connection I’m using for the network.

Friday
Aug122016

In a pinch...

A fun story for the sysadmins out there on an ugly situation that got fixed relatively easily. Recently, I ran into a situation in a client datacenter where they were running a FreeNAS system where the USB key with the OS had died. All of the important file services, notably the NFS service to a couple of ESXi servers were still running, but anything that touched the OS was dead. So no web console, and no SSH connections.

In my usual carry bag, I have my MacBook Pro, Lightning to GbE adaptor and a Samsung 1Tb T2 USB 3 flash drive, formatted in ZFS. And of course some spare USB keys.

So first up, using VMware Fusion, I installed the latest version of FreeNAS on a spare key in case the original was a complete loss. How to do this? Well, you can’t boot a BIOS based VM off a USB key, but you can boot from an ISO and then connect the USB key as a destination for the install. So now I have something to run the server on later.

Then the question is, how to swap this out without taking down the production machines that are running on the ESXi servers? For this I created a new Ubuntu VM and installed ZFS on Linux plus the NFS kernel server. Now that I have an environment that has native USB 3 and automatic NFS publishing from the ZFS attribute “sharenfs” I connected the Samsung T2 to the VM and imported the zpool. I couldn’t use FreeNAS in this case since it’s support for USB 3 is not great.

Then there was a quick space calculation to see if I could squeeze the running production machines into the free space. I had to blow away some temporary test machines and some older iso images to be sure I was OK. Then create a new file system with the ever so simple “zfs create t2ssd/panic” followed by “zfs set sharenfs=on” and open up all of the rights on the new filesystem. Oh, and of course, “zfs set compression=lz4” wasn’t necessary since was already on by default on the pool.

Then it was just a matter of mounting the NFS share on the ESXi servers and launching a pile of svMotion operations to move them to the VM on my portable computer on a USB Drive. Despite the complete non enterprisey nature of this kludge, I was completely saturating the GbE link (the production system runs on 10GbE - thank god for 10GBase-T and ethernet backwards compatibility).

Copying took a while, but after a few hours I had all of the running production machines transferred over and running happily on a VM on my portable computer on a USB Drive.

Then it was just a matter of rebooting the server off of the new USB key, importing the pool and setting up the appropriate IP addresses and sharing out the volumes. Once the shares came back online, they were immediately visible to the ESXi servers.

Then I left the MacBook in the rack overnight while the svMotion operations copied all the VMs back to the points of origin.

Best part: nobody noticed.

Tuesday
Mar292016

My new mobile lab

Mobile lab

I’ve always tried to have a decent home lab setup and have been happily using various mixes of HP Microservers, Intel NUCs, Mac Minis and various white box systems. Of course, each of these systems have their particularities and limitations notably in terms of memory, number of NICs, size, power, etc.

But a few things keep happening and quite often the home lab ends up running bits and pieces of my home “production” network, so I’m finally going all in on a new lab setup. I spend a lot of time working with companies on high availability designs, including multi site setups and am looking more and more into the new generation of scale out storage systems so I needed to go a little larger than the usual 3-4 NUCs.

I’ve also been running into issues when giving courses at various engineering schools where I’d really like to have my own mobile infrastructure that I control completely the entire stack, and being able to mock up a fully configured infrastructure. This is of particular importance with schools where the equipment can be older or constrained in frustrating ways.

I’ve been tempted for a while by the beautifully designed solutions from Tranquil PC, especially the original Orange Box, which is sadly hitting end of life, to be replaced by the new V4N Cluster. They’re both lovely, but you pay for that quality engineering and didn’t quite fit a few of my more exotic requirements.

A recent article by Steve Atwood tipped me off to a new sales channel for various types of Mini PCs that come much closer to my ideal systems than any of the previous options, with the added bonus that they’re very inexpensive for what you get. With this new source in hand, I’m stepping out into building a complete mobile lab setup with 8 lab servers & one deployment box. My first planned lab is setting up a simulated two site environment with 3 bare metal ScaleIO nodes per site, feeding two ESXi servers. This pretty much defined the minimum requirements in terms of the number of machines. This design also drove the choice to go with two separate switches, so I can do mean things like shut off all connectivity to a site and see the results, or just cut the inter site connection and so on. This should give me much more insight into the various potential failure modes.

Shopping list

So the basic shopping list is for:

Add ons:

Equipment arrived

Mini PC Configuration

The basic configuration is 16Gb of RAM (still waiting for Skylake to push this to 32, but for lab purposes, it’s fine, and less expensive than 16Gb DIMMs), a 128Gb SATA mSATA drive and a 2.5” 7200 RPM 500Gb spinning disk. There is still an available mSATA slot for another short card if I need later. The 2.5” drives didn’t come with the PCs as I ordered them, so I filled these in myself.

I went with the i5 5200U since it had the fastest baseline frequency of the available options from the particular vendor I chose. I won’t be needing much in the way of multithreading or powerful graphics in the lab (sorry to my colleagues who want to borrow it for VDI testing).

The model I selected has dual Ethernet NICs, so I can properly set up redundant connections and load balancing as required. Unfortunately they are Realtek cards so I’m going to have to do some tweaking to the ESXi image I use since they are unsupported by VMware.

Preparation

The idea of this project is to have a lab that can quickly be repurposed for different types of environments, so the first steps involve working out the details for the master auto deployment server, tweaking the images and doing some basic burnin on the servers and getting all of the BIOS settings just right.

So the theory (that I have, which is mine) is that one server will be running ESXi, with the necessary VMs to make this work. First up will be a pfSense instance to manage the networking, both so I can plug into the local network and get internet access for the various boxes for stuff like downloading packages and so on.

Burn in and stuff

Here’s where I ran into my first set of roadblocks. I had imagined at the beginning that I’d be doing PXE installs to USB keys and also doing some manual configuration with building custom USB keys for various configurations. I had not accounted for the strangeness of the BIOS settings around UEFI and the tweaks necessary to get this going reliably.

I started by booting a USB key with the installer for Ubuntu 15 which booted just fine. I installed onto a SANDisk USB key and the install process went just fine. At this point however, no combination of configuration options would let me boot from this key. Thinking ahead to the future, I realized that this was going to be a right royal PITA if I had to find methods of tweaking every install configuration.

After a quick tour of the market it turns out that there are 32Gb mSATA SSDs available for not much more than a 32Gb USB key, and there are two slots in the boxes, so I ordered a batch of Transcend mSATA 32 Go SSDs to give me a stable environment that will be visible as /dev/sda or C: for my OS installs.

Why would I want to do this? The main reason is that there are a number of products that I want to test out that will leverage a SSD device for data tiering or caching, but generally, they want use a dedicated physical device for this role (can you say VSAN?) so I wanted a dedicated boot volume, a reasonably sized dedicated SSD and a spinning drive for the bulk storage back-end.

In the meantime, while waiting for that order to arrive, I went ahead and started installing Ubuntu onto the existing internal SSDs and started firing up mprime to push the boxes and ensure that they are all in good shape and to get a real world idea of just how much power they will be drawing under maximum stress.

My general impressions of the machines are that they are larger than I had imagined, as I had for some reason pictured them being closer in size to a Mac Mini. But they are solid and well built and with the passive cooling they are pretty heavy.

Results

I’m using the Elgato Eve HomeKit plugs for measuring power consumption during these tests and for the machines that don’t yet have the internal hard drives installed, they were peaking briefly at 23W until the thermal protection kicked in and pulled it back down to 22W.

From a performance standpoint, the processors were able to maintain a slight Turbo effect and were running continually at about 2.4Ghz instead of their rated 2.2Ghz. This worked fine with the servers mounted vertically which is close to the planned configuration. For fun, I did one run of 8 hours with the servers stacked on top of each other. In this setup, the speed dropped significantly (down to 1.8 Ghz) as the bottom box just couldn’t get rid of heat fast enough, but the thermal regulation kicked in as expected and despite the slow down, nothing ever crashed or misbehaved.

The other useful thing I noted at this stage is that the power supplies are passing 12v. This means that I can clean up the physical installation a lot by getting a single ATX PSU and wiring the DC adaptor cables out of there. This will make the whole thing a lot cleaner since the included power bricks are fairly large and come with heavy cables.

Observations

One interesting thing that turns out to work well for me is that unlike many systems, the front facing USB3 ports do allow booting. In my experience, many systems will only boot from USB2 ports. This has allowed me to revisit my initial assumptions about the physical installation in the box. Given that I can now boot from the front facing USB ports as required, I can now run all of the cables underneath and have easy access to the power button (something that was troubling me) and the USB3 ports. But given that I’m going to be using the 32Gb internal SSD, I won’t be needing these ports very often anyway, but it’s nice to know the option is there if I keep a bag of preconfigured USB keys nearby.

The other thing that I was able to test after having some issues with a SANDisk USB key that I was using for the ESXi installation on the master node (configuration changes were not getting saved), I swapped it out for an SD card which is working much better and is bootable. I may go back and revisit the install configurations on the other nodes using SD cards once I get the whole system up and running. Although I still think that for most stuff, an internal 32Gb SSD will be more reliable and perform better.

But I also like the SD card for quick backups. I plug it into my MacBook and a quick:

dd if=/dev/rdisk6 of=masteresxi.backup bs=1m

gives me an image if I need to reflash it or if the SD card dies.

Physical installation

This was a bit of a tough one. I started with the following constraints in mind:

  • I wanted a Pelican case since they are tough and on an overall weight/size to content ratio better than building a traditional mobile rack solution.
  • I also wanted a model that was closer to a carry on bag than a big suitcase. My workshop is much more oriented towards woodworking than plastic or metalwork, but I’ve got a hacksaw and a cheap drill press so I figured I had the basics necessary to finding ways to mount all of this stuff into a case. Between Amazon, a few specialty stores and the local hardware store I figured I could build something fairly solid.

I spent a lot of time fiddling in 123D Design to see just how I could arrange all of the various components in the smallest reasonable size Pelican case. My love for symmetry was stymied by the dimensions of the various components and desire to stick to a rolling case that was closer to carry-on size rather than a full-on mobile rack. So after many different attempts, I ended up with this as the basic layout :

I also realized that I’ll be voiding the Pelican waterproof warranty since I’ll need to drill a few holes for some of the supports where glue won’t do the job. But the amount of time this box will be out in inclement weather should be relatively short and I’ll try and put appropriate washers on the exposed spots.

But after checking out the local hardware stores, I ended up checking into one of the shops that had a proper metalworking shop attached to the store and they do custom work, so I dragged over the box, a few servers, the switches and asked for a design quote.

For a reasonable price, they designed and built a complete setup that will also be removeable so if I need to pull it all out and rewire stuff, this will give me access to the backs of the servers and should make cabling much easier. Their idea was to build a flat plate supported by rubber mount isolated feet to give me the space for the cabling, cutouts for passing cables underneath and bent plates that attach to the VESA mounts.

I goofed on the original design description and he heard that there were 8 machines in total rather than the 8 plus the one master on the side, so I had to go back and get the plate redone, but since he’d already done the bulk of the design work, it was just configuring the laser cutter for another run and adding the screw mounts back on. Here’s the naked original version of the baseplate:

From the top:

Power issues

My original design estimations were way optimistic on the amount of space that the power plugs and cables were going to take, so I was very happy to discover that the servers all run off 12V (some PCs in this space use 19V input, notably many of the Intel NUCs). With this in mind, I figured that I should be able to convert an ATX PSU to feed all of the servers. I ended up ordering a http://www.corsair.com/en-us/vs-seriestm-vs650-650-watt-power-supply from Amazon as it’s complete overkill at 600W on 12V (28*9 = 252), but it had the following things going for it:

  • it’s not ugly :-)
  • it’s reasonably efficient
  • it’s quiet
  • and most importantly, uses a single 12V rail so cutting out the cables will be a little easier and I don’t have to keep track of which circuit they’re on

And a quick tour of the web and I found lots of tutorials on how to take a PSU and rework it for use as a generic 12V power supply, so back to the hardware store to order a soldering iron (my plumbing oriented torch is a little overkill for this kind of work). The upshot is that you need to short circuit two wires so it thinks that there’s a motherboard connected and the switches are always on so there’s a minimum draw as soon as it turns on.

Then I spent ages scouring the web to try and find the right sized DC connectors that fit into the back of the machines.

As it turns out there are two pretty standard designs that are 5.5mm outside and either 2.1 or 2.5mm for the internal post. The Netgear switches take the 5.5/2.1 sized ones that are also used widely for POE cameras so that was no problem. The initial batch of cables I got for the PCs were also of this lineage which turned out to be an issue since the cables were far too thin to support the draw from the PCs. This resulted in my first power-on test being followed by gently smoking plastic about 10 minutes later.

So I look around further and found a store that supplied just the barrel connectors themselves and I went back and soldered these directly onto the wires coming out of the PSU.

Note: soldering this kind of barrel connector is a right royal PITA, I highly recommend getting them preinstalled on wires if you can find them. It’s a lot easier.

So after reworking all of the power cabling, I fired it up again and this time, no smoke. With all the machines running, but not doing much the power draw on the entire system is about 130W. With the machines in the box, they are warm, but not hot to the touch. I haven’t yet dared firing up a full mprime run on all of them simultaneously yet to see how hot it gets. Since all of the machines are passively cooled, the only noise they make is the spinning hard drive which is pretty well damped by the heavy chassis. The PSU fan is not a noiseless model but for practical purposes, it’s very quiet.

The only power issue that is still bothering me is that the HDMI switch requires 5.3V and it’s pretty picky about it. I tried driving it from the 5V rail on the PSU, but that didn’t work, so for the moment, I still have the power adaptor for that one hanging around. If someone wants to point me to an electronics kit that can take 12V in and output regulated 5.3V, I’m all ears since it’s the only thing in the case that’s not powered off the PSU.

HDMI Switch

A poor man’s KVM, coupled with a Logitech radio keyboard and mouse I can get into individual machines as required coupled with a cheap Logitech wireless mouse and keyboard attached to the USB port. Generally speaking, I will only be using this when I’m tearing down and rebuilding the environment and need to force select the PXE boot on the BIOS.

Networking

I’m using the same basic structure as I use for many small independent sites, using pfSense as the router and firewall in a VM.

So inside the master ESXi, I have a single vSwitch with two uplinks to the two switches using VLAN tagging for all of the declared VLANs except for the default VLAN since these switches require that each port have a primary untagged VLAN. VLAN 100 is reserved for the pfSense WAN interface and this VLAN is set to the native VLAN to the ports 14 on the switches for connection to a local network. This lets the internal VMs talk to the outside world and have internet access for downloading images, packages etc.

All of the declared VLANs are connected to the pfSense VM as local networks to enable routing between them with the exception of the vMotion VLAN. I would like to have isolated the storage networks completely as well, but in order to have access to administrative interfaces I decided to leave this as a routed subnet as well. But I might start exploring some of the new USB GbE adaptors for adding administrative interfaces when doing bare metal storage clusters since there are still a few free ports on the switches.

For user access, I can connect directly to the ports 10-13 which are natively tagged to the internal infrastructure VLAN. This might be an issue for some environments like when I have 10 students that need to connect at once, so I tried to connect the internal wifi card to the pfSense VM, but the cards are a Broadcom model not supported by FreeBSD so for the moment, that solution won’t work. In the meantime, digging around in the parts bin I found a Ralink USB key which was being used with an ITUS Wifi Shield (may the project RIP). Attaching that key to the pfSense box, gives me the ability to use it to create a Wifi network so that I can connect and manage the environment over Wifi and most importantly in school or training setups, just let people connect directly via this interface. It’s not great since it’s a tiny antenna and only does 802.11b/g.

I ordered a Mini PCIe Atheros cards that comes with connectors for the external antennae and managed to get that installed and mapped using VMDirectIO to the pfSense machine as a second wifi access point. That worked much better in terms of coverage than the little Ralink, but once in the box, surrounded by cables, not so much. I’ve ordered a set of coax extension cables so that I can put the antennae on the top of the case and get them out of their electromagnetic cage.

Switch configuration

Depending on the environment I’m testing, I’m going to need to do different configuration to the ports I’m assigning to the servers. Netgear does include an option where you can save configurations, so I’ll be setting up the various configurations and storing them on the NAS to be able to quickly swap out configurations as required.

Final pics

Cables

Before going in the Pelican

Final configuration

So that’s it for part one. Next up will be all of the details around the software and design for managing the lab itself.

Friday
Mar202015

Can't register vSphere Replication appliance

I ran into an interesting problem the other day when deploying vSphere Replication where the Appliance couldn’t register the service with vCenter. It turns out to be a combination of factors about the network configuration that can produce this problem. The problem is most likely to occur if you are using the vCSA.

As far as I can tell, the sequence of events for registering with vCenter is the following:

  • use the address or IP currently in use for the active Web Client session to contact vCenter
  • request the value of the Runtime settings vCenter Server name
  • contact the vExtension service based on the name returned in the previous step

And there is where the problem comes from. By default, when you install the vCSA, the value stored in the Runtime settings is the short name of the server, not the FQDN. At least this is the case on the v5.x versions. I haven’t yet tested the 6.0 vCSA.

The net result depends on how your network is configured and whether you are using DHCP or not. I was running into the problem and able to reproduce it with the following sequence of actions:

  • Configure DNS correctly with proper forward and reverse entries for the vCSA and the Replication Appliance
  • On a subnet with no DHCP services, deploy the vCSA with a fixed IP address
  • On the same subnet, deploy the vSphere Replication appliance with a fixed IP address

This will fail since when you configure the vSphere Replication appliance with a fixed IP there’s no place to enter DNS search domains so there’s no way the name resolution will work for a short name returned by the vCSA. If you are deploying using DHCP, you will probably be sending search domains to the client so the resolution will work properly.

When you try to go to the VAMI console of the replication appliance and try to manually connect to the vCenter server you will get the following somewhat misleading error message:

“Unable to obtain SSL certificate: Bad server response; is a LookupService listening on the given address?”

It would have been nice if the message mentioned the address that it was trying to contact which would have highlighted the fact it was looking at a short name.

The workaround is to simply update the runtime settings vCenter name to the FQDN. It’s also probably a good idea to verify that the FQDN in Advanced settings is has the correct value as well.

So if you ever see an appliance that has to register an extension to the vCenter web UI and it isn’t working, checking the value of the Runtime settings vCenter name might be the solution.

Thursday
Aug142014

Understanding the impact of scale-out storage

Scale-out has the ability to change everything

In the software-only space solutions like Datacore and Nexenta are really quite good (I have used and deployed both) and I still recommend them for customers that need some of their unique features, but they share a fundamental limitation in that they are based on a traditional scale-up architecture model. The result is that there is still a fair bit of manual housekeeping involved in maintaining, migrating and growing the overall environment. Adding and removing underlying storage remains a relatively manual task and the front end head units remain potential choke points. This is becoming more and more of an issue with the arrival of high performance flash, especially when installed directly on the PCIe bus. The hiccup is that you can end up in situations where a single PCIe Flash card can generate enough IO to saturate a 10GbE uplink and a physical processor which means you need bigger and bigger head units with more and more processing power.

So the ideal solution is to match the network, processor and storage requirements in individual units that spread the load around instead of all transiting through central potential choke points. We’re seeing a number of true scale-out solutions hitting the market right now that have eliminated many of the technical issues that plagued earlier attempts at scale-out storage.

The secondary issue with scale out changes the way you purchase storage over time. The over time part is a key factor that keeps getting missed in most analysis of ROI and TCO since most enterprises that are evaluating new storage systems are doing so in the context of their current purchasing and implementation methodology: They have an aging system that needs replacing so they are evaluating the solution as a full on replacement without truly understanding the long term implications of a modern scale-out system.

So why is this approach different? There are two key factors that come into play:

  • You buy incremental bricks of capacity and performance as you need them
  • Failure and retirement of bricks are perceived identically by the software

To the first point, technological progress makes it clear that if you can put off a purchase you will get a better price/capacity and price/performance ratio that you have today. Traditionally many storage systems are purchased with enough head room for the next 3 years which means you’re buying tomorrow’s storage at today’s prices.

So this gives us the following purchase model:

This is a simplified model based on the cost/Gb of storage but applies to all axes involved in storage purchase decisions such as IOPS, rack density, power consumption, storage network connections and so on. Also remembering that you might end up with bricks that still cost $x, but have 50% more capacity in the same space. A key feature of properly done scale out storage is the possibility of heterogeneous bricks where the software handles optimal placement and distribution for you automatically. For “cold” storage, we’re seeing 3Tb drives down under the $100 mark, but 6 Tb drives are now available to the general public. If you filled up your rack with 3Tb drives today, you’d need twice the space and consume twice the power than if you could put off the purchase until the 6Tb drives come down in price. For SSDs, Moore’s Law is working just fine as we see die-shrinks increase the storage density and performance on a regular cycle.

In some organisations this can be a problem since they have optimized their IT purchasing processes around big monolithic capital investments like going to RFP for all capital investments which means that the internal overhead incurred can be counterproductive. But these are often the same organisations that are pushing for outsourcing everything to cloud services so that storage becomes OpEx, but this type of infrastructure investment lives somewhere between the two and needs to be treated as such. Moving straight to the cloud can be a lot more expensive, even when internal soft costs are factored in. Don’t forget that your cloud provider is using the the exact same disks and SSDs as you are and needs to charge for their internal management plus a margin.

And on to the upgrade cycle…

The other critical component of scale-out shared-nothing storage is that failure and retirement are perceived as identical situations from a data availability perspective (although they are different from a management perspective). Properly designed scale-out systems like Coho Data, ScaleIO, VSAN, Nutanix, SimpliVity and others guarantee availability of data by balancing and distributing copies of blocks across failure domains. At the simplest level a policy is applied that each block or object must have at least two copies in two separate failure domains, which for general purposes means a brick or a node. You can also be paranoid with some solutions and specify more than two copies.

But back to the retirement issue. Monolithic storage systems basically have to be replaced at least every 5 years since otherwise your support costs will skyrocket. Understandably so since the vendor has to keep warehouses full of obsolete equipment to replace your aging components. And you’ll be faced with all the work of migrating your data onto a new storage system. Granted, things like Storage vMotion make this considerably less painful that it used to be, but it’s still a big task and other issues tend to crop up, like do you have space in your datacenter for two huge storage systems during the migration? Enough power? Are the floors built to take the weight? Enough ports on the storage network?

The key here is that in case of a brick failure in a scale-out system, this is detected and treated as a violation of the redundancy policy. So all of the remaining bricks will redistribute/rebalance copies of the data to ensure that the 2 or 3 copy policy is respected without any administrative intervention. When a brick hits the end of its maintainable life, it just gets flagged for retirement, unplugged, unracked and recycled and the overall storage service just keeps running. This a nice two-for-one benefit that comes natively as a function of the architecture.

To further simplify things you are dealing with reasonably-sized server shaped bricks that fit into standard server racks, not monolithic full-rack assemblies.

Illustrated, this gives us this:

Again, this is a rather simplistic model, but with constantly growing storage density and performance, you are enabling the storage to scale with the business requirements. If there’s an unexpected new demand, a couple more bricks can be injected into the process. If the demand is static, then you’re only worried about the bricks coming out of maintenance. It starts looking at lot more like OpEx than CapEx.

This approach also ensure that the bricks you are buying use components that are sized together correctly. If you are buying faster and more space on high performance PCIe SSD, you want to ensure that you are buying them with the current processors capable of handling the load and that you can handle the transition from GbE to 10GbE to 40GbE, …

So back to the software question again. Right now, I think that Coho Data and ScaleIO are two of the best standalone scale-out storage products out there (more on hyperconvergence later), but they are both coming at this from different business models. ScaleIO is strangely the software-only solution from the hardware giant, while Coho Data is the software bundled with hardware solution from part of the team that built the Xen hypervisor. Andy Warfield, Coho Data’s CTO has stated in many interviews that the original plan was to sell the software, but that they had a really hard time selling this into the enterprise storage teams that want a packaged solution.

I love the elegance of the zero configuration Coho Data approach, but wish that I wasn’t buying the software all over again when I replace a unit when it hits EOL. This could be regulated with some kind of trade-in program.

On the other hand, I also love the tunability and BYOHW aspects of ScaleIO, but find it missing the plug and play simplicity and the efficient auto-tiering of Coho Data. But that will come with product maturity.

It’s time to start thinking differently about storage and reexamining the fundamental questions and how we buy and manage storage.