Entries in vsphere (2)


An interesting dual-site ScaleIO Configuration (probably unsupported)

ScaleIO is a member of the new class of scale-out storage systems that permits you to scale-out your storage by adding additional nodes either in a hyperconverged configuration with VMs installed in your hypervisors or as a bare-metal storage cluster.

I have been a fan of this type of architecture since it gets rid of many of the limitations of the traditional scale-up SANs and offers (potentially) a new degree of portability and finally the end of the fork-lift upgrade cycle.

However, with the latest version of ScaleIO there are some odd design choices that can be problematic in the smaller and mid-sized environments. Specifically, it is now enforcing the minimum of three fault sets (should you decide to use them). The concept of the fault set is a group of nodes that are more likely to fail as a group due to some common dependency, generally power to a rack. For data protection reasons, whenever a block is written, a second copy block is written to another node in the cluster. Adding fault sets to the mix forces this second block to go to a node outside the fault set where the original block was written to ensure availability.

The problem with ScaleIO’s new enforcement of the three fault set model is that this means you can no longer easily build out a dual room configuration for availability which is pretty much the design of most highly available configuration in small and medium sized configurations (and even in quite a number of large ones). With this limitation in mind and knowing a bit about the way the data paths and metadata are placed in ScaleIO I decided to see if this really was a hard limitation or if there was a way to work around it to build a more traditional dual-site configuration with the 2.0 release.

Cluster configuration

In order to ensure a minimum level of viability when one site is offline, I set up a test bed with a cluster of two fault sets of three nodes each. The nodes used here all have three 100 Gb disks (yes, these are virtual machines). There is also a third fault set configured with a single node with the minimum of 100 Gb of storage assigned to it.

There is a shared L2 network across the entire cluster for storage services so this would be similar to having a stretched VLAN across two rooms.

On the MDM side of things, I used the 5 node cluster configuration with the primary MDM in one fault set and the standby in the second fault set.

These are attached to a three node vSphere cluster to general load and test connectivity with a half-dozen Linux VMs.


Once all of the ScaleIO nodes are online, I can use the CLI or the vSphere plugin to create and map volumes from the cluster to the SDCs on the ESXi hosts. Here there is no problem. There is an alert in the ScaleIO reporting that the fault sets are not balanced, but this simply has the result that the data distribution is not equal by volume across the fault sets, but simply by percentage used. Otherwise, the cluster is fully operational. At this stage I have all of the VMs running nicely and am running bonnie++ to generate a read and write load across the cluster.

At this point I take the single node of the third fault set offline politely using the delete_service.sh command in /opt/emc/scaleio/sds/bin.

This has the expected result of activating a rebuild operation to properly protect the blocks that were stored on the 100 Gb of the third fault set. Since there is a relatively small amount of data involved, this goes fairly quickly.

At this point, the storage is still available and operational to the SDCs and everything is running. However there is one limitation at this point: I cannot modify the structure of the cluster without the third fault set online. That’s to say I can’t create or delete volumes to present to the SDCs. In a steady state operation this is not a big deal since I don’t modify the volumes on a daily basis.

Once the rebalance has finished, I have my desired state: a dual-site setup with data being written across the two fault sets that are online. Now for the “disaster” test. Here I brutally poweroff all three of the nodes in one of the remaining fault sets and observe the results. At this stage, the result is that the storage is still available to the SDCs and the VMs are still running and generating read/write traffic. So we have a reasonable DR test for a single site failure.

Now for the fail-back: I bring the nodes in the failed fault set back online and the expected rebuild operation kicks off, reestablishing the two fault-set cluster with blocks distributed across the two fault sets.


ScaleIO is an impressively robust and resilient system that allows for things that the designers probably didn’t have in mind. That said, a simple dual-room setup based on two fault sets with a minimum number of nodes per fault set should be part of the standard configuration options given the ubiquity of this type of configuration and to put them on level competitive ground with all of the dual-site HA offerings available from HP, Huawei, Datacore, etc.

And to finish, I would also recommend separating the MDM roles from the SDS on completely different systems, perhaps in VMs pinned to local storage on site for a clear separation of responsibility. For those getting started with ScaleIO the fact that the two roles can cohabit the same servers can lead some confusion when you’re just getting started and not clear on the dependencies.


Can't register vSphere Replication appliance

I ran into an interesting problem the other day when deploying vSphere Replication where the Appliance couldn’t register the service with vCenter. It turns out to be a combination of factors about the network configuration that can produce this problem. The problem is most likely to occur if you are using the vCSA.

As far as I can tell, the sequence of events for registering with vCenter is the following:

  • use the address or IP currently in use for the active Web Client session to contact vCenter
  • request the value of the Runtime settings vCenter Server name
  • contact the vExtension service based on the name returned in the previous step

And there is where the problem comes from. By default, when you install the vCSA, the value stored in the Runtime settings is the short name of the server, not the FQDN. At least this is the case on the v5.x versions. I haven’t yet tested the 6.0 vCSA.

The net result depends on how your network is configured and whether you are using DHCP or not. I was running into the problem and able to reproduce it with the following sequence of actions:

  • Configure DNS correctly with proper forward and reverse entries for the vCSA and the Replication Appliance
  • On a subnet with no DHCP services, deploy the vCSA with a fixed IP address
  • On the same subnet, deploy the vSphere Replication appliance with a fixed IP address

This will fail since when you configure the vSphere Replication appliance with a fixed IP there’s no place to enter DNS search domains so there’s no way the name resolution will work for a short name returned by the vCSA. If you are deploying using DHCP, you will probably be sending search domains to the client so the resolution will work properly.

When you try to go to the VAMI console of the replication appliance and try to manually connect to the vCenter server you will get the following somewhat misleading error message:

“Unable to obtain SSL certificate: Bad server response; is a LookupService listening on the given address?”

It would have been nice if the message mentioned the address that it was trying to contact which would have highlighted the fact it was looking at a short name.

The workaround is to simply update the runtime settings vCenter name to the FQDN. It’s also probably a good idea to verify that the FQDN in Advanced settings is has the correct value as well.

So if you ever see an appliance that has to register an extension to the vCenter web UI and it isn’t working, checking the value of the Runtime settings vCenter name might be the solution.