Entries in storage (10)


An interesting dual-site ScaleIO Configuration (probably unsupported)

ScaleIO is a member of the new class of scale-out storage systems that permits you to scale-out your storage by adding additional nodes either in a hyperconverged configuration with VMs installed in your hypervisors or as a bare-metal storage cluster.

I have been a fan of this type of architecture since it gets rid of many of the limitations of the traditional scale-up SANs and offers (potentially) a new degree of portability and finally the end of the fork-lift upgrade cycle.

However, with the latest version of ScaleIO there are some odd design choices that can be problematic in the smaller and mid-sized environments. Specifically, it is now enforcing the minimum of three fault sets (should you decide to use them). The concept of the fault set is a group of nodes that are more likely to fail as a group due to some common dependency, generally power to a rack. For data protection reasons, whenever a block is written, a second copy block is written to another node in the cluster. Adding fault sets to the mix forces this second block to go to a node outside the fault set where the original block was written to ensure availability.

The problem with ScaleIO’s new enforcement of the three fault set model is that this means you can no longer easily build out a dual room configuration for availability which is pretty much the design of most highly available configuration in small and medium sized configurations (and even in quite a number of large ones). With this limitation in mind and knowing a bit about the way the data paths and metadata are placed in ScaleIO I decided to see if this really was a hard limitation or if there was a way to work around it to build a more traditional dual-site configuration with the 2.0 release.

Cluster configuration

In order to ensure a minimum level of viability when one site is offline, I set up a test bed with a cluster of two fault sets of three nodes each. The nodes used here all have three 100 Gb disks (yes, these are virtual machines). There is also a third fault set configured with a single node with the minimum of 100 Gb of storage assigned to it.

There is a shared L2 network across the entire cluster for storage services so this would be similar to having a stretched VLAN across two rooms.

On the MDM side of things, I used the 5 node cluster configuration with the primary MDM in one fault set and the standby in the second fault set.

These are attached to a three node vSphere cluster to general load and test connectivity with a half-dozen Linux VMs.


Once all of the ScaleIO nodes are online, I can use the CLI or the vSphere plugin to create and map volumes from the cluster to the SDCs on the ESXi hosts. Here there is no problem. There is an alert in the ScaleIO reporting that the fault sets are not balanced, but this simply has the result that the data distribution is not equal by volume across the fault sets, but simply by percentage used. Otherwise, the cluster is fully operational. At this stage I have all of the VMs running nicely and am running bonnie++ to generate a read and write load across the cluster.

At this point I take the single node of the third fault set offline politely using the delete_service.sh command in /opt/emc/scaleio/sds/bin.

This has the expected result of activating a rebuild operation to properly protect the blocks that were stored on the 100 Gb of the third fault set. Since there is a relatively small amount of data involved, this goes fairly quickly.

At this point, the storage is still available and operational to the SDCs and everything is running. However there is one limitation at this point: I cannot modify the structure of the cluster without the third fault set online. That’s to say I can’t create or delete volumes to present to the SDCs. In a steady state operation this is not a big deal since I don’t modify the volumes on a daily basis.

Once the rebalance has finished, I have my desired state: a dual-site setup with data being written across the two fault sets that are online. Now for the “disaster” test. Here I brutally poweroff all three of the nodes in one of the remaining fault sets and observe the results. At this stage, the result is that the storage is still available to the SDCs and the VMs are still running and generating read/write traffic. So we have a reasonable DR test for a single site failure.

Now for the fail-back: I bring the nodes in the failed fault set back online and the expected rebuild operation kicks off, reestablishing the two fault-set cluster with blocks distributed across the two fault sets.


ScaleIO is an impressively robust and resilient system that allows for things that the designers probably didn’t have in mind. That said, a simple dual-room setup based on two fault sets with a minimum number of nodes per fault set should be part of the standard configuration options given the ubiquity of this type of configuration and to put them on level competitive ground with all of the dual-site HA offerings available from HP, Huawei, Datacore, etc.

And to finish, I would also recommend separating the MDM roles from the SDS on completely different systems, perhaps in VMs pinned to local storage on site for a clear separation of responsibility. For those getting started with ScaleIO the fact that the two roles can cohabit the same servers can lead some confusion when you’re just getting started and not clear on the dependencies.


Why the reticence to trying scale out storage?

I’ve been running into a few projects where I’ve been working with different companies that are in the process of doing a storage refresh and for some reason I’m seeing some fairly strong push back against considering some of the newer scale-out storage solutions.

I find this rather interesting, as the advantages offered by a scale-out solution with the potentially longer life-span of the software layer when you hit the end of life part of the hardware cycle are so much more interesting than continuing on with a traditional storage architecture with the attendant fork-lift upgrades.

In some environments there is a significant sunk cost issue with an existing mature Fiber Channel environment that has to be taken into account. But this can be mitigated by playing to the scale out architecture’s advantages and starting small and growing over time, assuming you’re not going to get killed on any maintenance charges on your existing storage systems. The other missing piece that does come into play on some systems are physical servers that are only FC attached and are not natively compatible with a scale-out system in which case you need some kind of gateway into the storage cluster.

Moving to scale-out means moving to Ethernet and for storage systems this generally means 10GbE Ethernet, so there is a non-trivial cost in switch investments, but again this opens the door for many other potential optimizations where your servers are now simply dual-attached 10GbE and you separate the networks via VLAN, reducing and simplifying the long-term datacenter architecture requirements.

For those already using iSCSI or NFS as their primary storage protocols, the Ethernet storage network is already in place and well segmented so there shouldn’t be any serious issues on that front.

At the end of the day, in a worst case scenario if you’re really not happy with the system, you’ll replace it in 5 years just like you did every other storage system you’ve ever bought. The next replacement may also be scale-out storage using a different software stack in which case you can leverage your commodity servers that are supplying the storage as long as they’re still maintainable. Or you can move back to iSCSI, NFS or SMB.

From this perspective I can only see upside in looking at scale-out solutions.

Try it, you might like it!


The one major pushback point that I find to be pertinent is the question about which vendors amongst the startups will actually be around in 5-10 years. This is definitely a tough question. I really like a lot of the innovative solutions out there ([Hedvig], [Coho Data], [Kaminario] etc.) but we don’t yet know if they will be able to survive in the cutthroat storage market in the long term. The usual exit for this kind of technology of being bought by one of the bigger players is looking less and less likely given that they have pretty much all made a choice in this arena with the exception of HP which currently only has their aging LeftHand scale-out solution.

But this still leaves us with the choices from the historic players if you prefer to stick to an existing brand with solutions like [ScaleIO] (EMC) and [SolidFire] (NetApp).

And of course there are the more tightly coupled solutions like [VSAN], and the hyperconverged players like [SimpliVity], [Nutanix] and [Scale Computing]

So there’s something for everyone in this market if you look around a bit.


Back to backups (yet again)

In the world of information technology, nothing is static and lasts forever, especially best practices. I’ve been pointing out to clients for a while now that backups need to be rethought in terms of the “jobs to be done” philosophy and no longer thought of as “the thing that happens overnight when files are copied to tapes”.

Historically, backups served two purposes :

  • Being able to go back in time and retrieve data that is no longer available
  • Serve as the basis for a disaster recovery

Fundamentally, backups should really only serve the first point. We have better tools and mechanisms for handling disaster recovery and business continuity. Which brings me to snapshots. I have always told people that snapshots are not backups even though they respond to the criteria of being able to go back in time.

The hiccup is that snapshots that are dependent on your primary storage system should be considered fragile, in the sense that if your primary storage goes away (disaster), you longer have access to the data or the snapshots. However, just about every storage system worth its salt today includes the ability to replicate data to another system based on or including the snapshots themselves. This is a core feature of ZFS and one I rely on regularly. Many of the modern scale out systems also include this type of functionality, some even more advanced than ZFS like the SimpliVity implementation.

When are snapshots backups?

They become backups once you have replicated them to another independent storage system. This responds to the two basic criteria of being able go back in time and be on a separate physical system so the loss of the primary does not preclude access to the data. They become part of your disaster recovery plan when the second system is physically distant from the primary.

Disk to Disk to Tape

We’ve already seen the traditional backup tools adopt this model to respond to the performance issues around coping with the every growing volume of file data so that data can trickle over to a centralized disk store which is directly connected to tape drives where they can be fed at full speed. Exploiting snapshot based replication permits the same structure, but assigns the responsibility of the disk to disk portion to the storage system rather than the backup software.

The question I ask in most cases here is whether the volume of data involved justifies the inclusion of tape as a backup medium. According to the LTO consortium, LTO6 storage is as low as 1.3 cents per Gb, but this only takes into account the media cost. The most bare bones of LTO drives runs around $2,200, which bumps up the overall cost per Gb rather dramatically.

Assuming a configuration where we store 72Gb of data on tape (12 tapes), at the $80 cost per tape cited by the LTO Consortium plus the cost of the drive, this works out to about 4.3 cents/Gb. At current street prices, the 6Tb WD Red drives run about $270 which converts to 4.5 cents per Gb, not taking into account the additional flexibility of disks that permit compression and deduplication. Note that the 6Tb cited for the LTO numbers already includes compression where the 6Tb disk is raw before compression and deduplication.

Tape does have some inherent advantages in certain use cases, particularly long term offline storage, and does cost less to operate on a $/watt basis, but for many small to medium sized environments, the constraints for using it as a primary backup medium (especially when it is also the primary restore medium) are far outweighed by the flexibility, performance and convenience of a disk based system for daily operations.

Operational convenience of disk over tape.

Tape is a great medium for dumping a full copy of a dataset, but when compared with the flexibility of a modern disk based system it falls far behind. A good example that I use is the ability to prune snapshots from a data set to reorganize the space utilisation. In many systems, I use hourly snapshots in order to give users the a decent amount of granularity to handle errors and issues during the day. This also means that the unit of replication on a given filesystem is relatively small, permitting me to recover from intersite communications failures and not have to resend huge data sets that might have been interrupted. Then on the primary system I prune out the hourly snapshots after a week to leave one daily instance to be retained for 2 weeks. A similar process is applied to weekly and monthly snapshots. Where this gets interesting is that I do not have to apply the same policy on the primary and backup storage systems. My backup storage system is designed for capacity and will retain a month of daily snapshots, 8 weekly snapshots and 12 monthly snapshots. The possibility of pruning data from a set is something that is impossible to do effectively using tape technology, so tape is used for an archival copy that needs to be retained beyond the yearly cycle.

Files vs virtual machines

The above-noted approach works equally well for file servers and storage systems hosting virtual machines, especially if we are using a file based protocol for hosting the VMs rather than a pure block protocol like FC or iSCSI. In the world of virtual machines backup tools are considerably more intelligent about the initial analysis of the data to be backed up. Traditional file server backup is based on a two phase process of scanning the contents of the source, matching this against an index of data known to be backed up and then copying the missing bits. This presents a number of practical issues :

  • the time to scan continues to grow with the number of files
  • copying many individual files is a slower process with more overhead that block based differentials

By applying the snapshot and replication technique, we can drastically reduce the backup window, since only the blocks modified between two moments in time need to be copied. In fact there is no longer a backup window since these operations are continuous in the background of the file server.

Virtual machines in the VMware world maintain tracking journals of modified blocks (CBT) which enables the backup software to ignore the filesystem representation of the data and just ask for the modified blocks to copy since the last backup transaction. But again, if we are transmitting snapshots from the underlying storage system, don’t even need to do this. It is, however useful to issue VSS snapshots inside of Windows virtual machines to ensure that any inflight data in caches is flushed to disk before creating the storage layer snapshot.

The biggest issue with backing up virtual machines is the granularity of the restore operation. With only a simple replication, the result is a virtual machine with no visibility into the contents of its internal file systems. This is where the backup tools show their value in being able to backup a virtual machine at the block level, and yet still permit file level restores by peeking inside the envelope to look at the contents of the file systems therein.

The last mile

There are still issues with certain types of restore operations that require a high level of integration with the applications. If you want to restore a single email out of a backed up Exchange or Notes datastore, you need a more sophisticated level of integration than simply having a copy of the virtual machine.

But for the majority of general purposes systems, and particularly file services, the simple replicated snapshot approach is simpler and more effective, both from a cost and operational perspective.


Understanding the impact of scale-out storage

Scale-out has the ability to change everything

In the software-only space solutions like Datacore and Nexenta are really quite good (I have used and deployed both) and I still recommend them for customers that need some of their unique features, but they share a fundamental limitation in that they are based on a traditional scale-up architecture model. The result is that there is still a fair bit of manual housekeeping involved in maintaining, migrating and growing the overall environment. Adding and removing underlying storage remains a relatively manual task and the front end head units remain potential choke points. This is becoming more and more of an issue with the arrival of high performance flash, especially when installed directly on the PCIe bus. The hiccup is that you can end up in situations where a single PCIe Flash card can generate enough IO to saturate a 10GbE uplink and a physical processor which means you need bigger and bigger head units with more and more processing power.

So the ideal solution is to match the network, processor and storage requirements in individual units that spread the load around instead of all transiting through central potential choke points. We’re seeing a number of true scale-out solutions hitting the market right now that have eliminated many of the technical issues that plagued earlier attempts at scale-out storage.

The secondary issue with scale out changes the way you purchase storage over time. The over time part is a key factor that keeps getting missed in most analysis of ROI and TCO since most enterprises that are evaluating new storage systems are doing so in the context of their current purchasing and implementation methodology: They have an aging system that needs replacing so they are evaluating the solution as a full on replacement without truly understanding the long term implications of a modern scale-out system.

So why is this approach different? There are two key factors that come into play:

  • You buy incremental bricks of capacity and performance as you need them
  • Failure and retirement of bricks are perceived identically by the software

To the first point, technological progress makes it clear that if you can put off a purchase you will get a better price/capacity and price/performance ratio that you have today. Traditionally many storage systems are purchased with enough head room for the next 3 years which means you’re buying tomorrow’s storage at today’s prices.

So this gives us the following purchase model:

This is a simplified model based on the cost/Gb of storage but applies to all axes involved in storage purchase decisions such as IOPS, rack density, power consumption, storage network connections and so on. Also remembering that you might end up with bricks that still cost $x, but have 50% more capacity in the same space. A key feature of properly done scale out storage is the possibility of heterogeneous bricks where the software handles optimal placement and distribution for you automatically. For “cold” storage, we’re seeing 3Tb drives down under the $100 mark, but 6 Tb drives are now available to the general public. If you filled up your rack with 3Tb drives today, you’d need twice the space and consume twice the power than if you could put off the purchase until the 6Tb drives come down in price. For SSDs, Moore’s Law is working just fine as we see die-shrinks increase the storage density and performance on a regular cycle.

In some organisations this can be a problem since they have optimized their IT purchasing processes around big monolithic capital investments like going to RFP for all capital investments which means that the internal overhead incurred can be counterproductive. But these are often the same organisations that are pushing for outsourcing everything to cloud services so that storage becomes OpEx, but this type of infrastructure investment lives somewhere between the two and needs to be treated as such. Moving straight to the cloud can be a lot more expensive, even when internal soft costs are factored in. Don’t forget that your cloud provider is using the the exact same disks and SSDs as you are and needs to charge for their internal management plus a margin.

And on to the upgrade cycle…

The other critical component of scale-out shared-nothing storage is that failure and retirement are perceived as identical situations from a data availability perspective (although they are different from a management perspective). Properly designed scale-out systems like Coho Data, ScaleIO, VSAN, Nutanix, SimpliVity and others guarantee availability of data by balancing and distributing copies of blocks across failure domains. At the simplest level a policy is applied that each block or object must have at least two copies in two separate failure domains, which for general purposes means a brick or a node. You can also be paranoid with some solutions and specify more than two copies.

But back to the retirement issue. Monolithic storage systems basically have to be replaced at least every 5 years since otherwise your support costs will skyrocket. Understandably so since the vendor has to keep warehouses full of obsolete equipment to replace your aging components. And you’ll be faced with all the work of migrating your data onto a new storage system. Granted, things like Storage vMotion make this considerably less painful that it used to be, but it’s still a big task and other issues tend to crop up, like do you have space in your datacenter for two huge storage systems during the migration? Enough power? Are the floors built to take the weight? Enough ports on the storage network?

The key here is that in case of a brick failure in a scale-out system, this is detected and treated as a violation of the redundancy policy. So all of the remaining bricks will redistribute/rebalance copies of the data to ensure that the 2 or 3 copy policy is respected without any administrative intervention. When a brick hits the end of its maintainable life, it just gets flagged for retirement, unplugged, unracked and recycled and the overall storage service just keeps running. This a nice two-for-one benefit that comes natively as a function of the architecture.

To further simplify things you are dealing with reasonably-sized server shaped bricks that fit into standard server racks, not monolithic full-rack assemblies.

Illustrated, this gives us this:

Again, this is a rather simplistic model, but with constantly growing storage density and performance, you are enabling the storage to scale with the business requirements. If there’s an unexpected new demand, a couple more bricks can be injected into the process. If the demand is static, then you’re only worried about the bricks coming out of maintenance. It starts looking at lot more like OpEx than CapEx.

This approach also ensure that the bricks you are buying use components that are sized together correctly. If you are buying faster and more space on high performance PCIe SSD, you want to ensure that you are buying them with the current processors capable of handling the load and that you can handle the transition from GbE to 10GbE to 40GbE, …

So back to the software question again. Right now, I think that Coho Data and ScaleIO are two of the best standalone scale-out storage products out there (more on hyperconvergence later), but they are both coming at this from different business models. ScaleIO is strangely the software-only solution from the hardware giant, while Coho Data is the software bundled with hardware solution from part of the team that built the Xen hypervisor. Andy Warfield, Coho Data’s CTO has stated in many interviews that the original plan was to sell the software, but that they had a really hard time selling this into the enterprise storage teams that want a packaged solution.

I love the elegance of the zero configuration Coho Data approach, but wish that I wasn’t buying the software all over again when I replace a unit when it hits EOL. This could be regulated with some kind of trade-in program.

On the other hand, I also love the tunability and BYOHW aspects of ScaleIO, but find it missing the plug and play simplicity and the efficient auto-tiering of Coho Data. But that will come with product maturity.

It’s time to start thinking differently about storage and reexamining the fundamental questions and how we buy and manage storage.


Understanding the value of software in storage

It’s all about the software

In today’s storage world, the reality is that the actual storage component and the surrounding hardware is all commodity based (with a few exceptions). A storage system is composed of disks, disk cases, communications links, processors, memory and networking.

Fundamentally, the disks are the same ones you can buy from Amazon, NewEgg et al. The only major observable difference is that enterprise storage drives tend to be equipped with SAS or NL-SAS which offers a more advanced command set and a more robust architecture permitting dual path connections as compared to SATA. NL-SAS drives are SATA drives with a smarter controller interface, but the mechanics are identical.

The disk cases | drawers | enclosures (pick your name) are all based on a standard structure with a SAS backplane that drives slot into and most of them are OEM’d from a very short list of vendors. Historically, these were often connected using Fibre Channel but pretty much everyone has come to terms with the fact that FC is unsustainably expensive for this and even the latest top of the line VMAX has gone over to SAS as the connection to the disk enclosures.

Internally, most proprietary interconnects (think RapidIO) have been standardized on 40GbE and Infiniband which, while expensive, are commercially available standard components.

On the processing front, with the exception of HP 3PAR’s custom ASICs, nearly everything else on the market are using standard Intel motherboards with standard Intel processors.

So why are storage systems so expensive? It’s all about the software that adds value to this collection of off the shelf parts in order to make them all work together in a coherent fashion and give you the features over and above just putting bits to disk and maintaining a certain amount of local redundancy.

How much am I paying for this software?

At the simplest level, go over to DELL or Supermicro and spec out a barebones DAS storage system per your requirements, add in a couple of servers with the number of 10GbE, FC & SAS ports you need. That’s your storage cost. Then get a quote from your quote from your storage provider. Ignore the costs assigned by part or by disk, at the end of the day it’s the negotiated package price that matters. The publicy-quoted prices are fantasies designed for impressing the purchasing department with huge rebates. I’ve even seen cases where the exact same part number has different list prices depending which model of storage controller you’re buying. So the only price that matters is the whole package with rebates.

The difference between the two is the software cost that you can now compare to a software-only solution like Nexenta or Datacore.

Now imagine that you are putting that money in the trash after the planned life-cycle of the storage investment; generally 3-5 years. You’ll be buying that software all over again with your next storage aquisition.

The key takeaway here is that the value in storage systems has moved from the actual storage hardware itself to the software. All of the storage components are commodity. IBM, EMC, NetApp et al, do not actually make any of the actual storage components. The disks are bought from Seagate, Western Digital, Toshiba, SSD’s from SANDisk, Intel, Samsung, RAID Controllers from LSI, Ethernet from Broadcom & Intel, FC from Qlogic and Brocade, motherboards from Intel.

You get integration and the software.

Is there a better way ?

The optimal approach would be to buy the commodity hardware and run your own software on it. This is the standard approach for companies like Nexenta and Datacore which bring all of the value add features one expects from enterprise storage like replication, snapshots, and so on, although granted through very different internal mechanisms.

Your software is a one-time cost with maintenance over time, but since it’s just software, the maintenance cost doesn’t skyrocket after 5 years. You replace the hardware as it becomes obsolete or your needs change inside the cost effective 5 year maintenance window, leveraging the software’s tools to make the migrations invisible to the servers consuming the storage. Your storage costs are reasonable since you’re only paying for the most basic of components without the markup that accompanies the software integrated into the system.

DELL Compellent has started thinking this way with their new licensing model that applies once you get to a certain size where you can replace the controllers for the cost of the hardware but migrate your existing software licences over, which puts it closer to Nexenta and Datacore from a business model standpoint.

But for some reason, a lot IT shops are leery of buying software to take this approach, for a variety of reasons from sales pressure from incumbent vendors (you should see the discounts when they feel threatened) and IT management’s desire to have “one throat to choke” in case anything ever goes wrong.

The other aspect is that while going the software route give your the ability to choose exactly what you want, this can also be a burden for IT shops that no longer have the in-house expertise to do basic server and storage design. The freedom of choice brings also the responsibility of making the right choices.

So when evaluating storage solutions, try and figure out exactly what you are paying for and understand how much of your investment is tied to the way that you buy it.