Entries in zfs (7)

Friday
Aug122016

Thunderbolt: a fast and cheap SAN

The lack up uptake by major vendors of Thunderbolt as a transport for storage systems has baffled me for a long time. For large systems there are a number of hard limitations that make it less than ideal, but in the smaller SAN/NAS space it seems like it would be a perfect replacement for SAS connected disk trays. 

This may be changing now that we are seeing some cluster interconnects coming on the market oriented towards leveraging the fact that Thunderbolt is basically a PCIe bus extension and we’re starting to see PCIe switches coming on the market from players like IDTAvago and Broadcom.

But getting back to practical applications, I started having a need for a portable SAN/NAS box to help with some client projects involving datacenter migrations or storage migrations where they needed a little extra swing space. So with the current state of the art there are quite a few Thunderbolt based storage systems that are well adapted to what I had in mind. But the first issue I ran into is that while Thunderbolt 3 on USB-C connectors is starting to appear in newer laptops, NUCs and MicroPCs, they are almost always single port setups and I wanted to be able to dedicate one port to storage and another port to the network interconnect. Which led me back to what appears to be the only small form factor dual Thunderbolt equipped machine on the market: the Mac Mini. Now this means that I’m “stuck” with Thunderbolt 2 over the Mini DisplayPort connections, but hey it’s only 20 Gbps so for a mobile NAS, this ought to be OK.

What pushed this over the line from something on my list of things I should do some day was the discovery of a relatively new line of disk bays from Akitio that have four 2.5” slots that will accept the thicker high density drives and are daisy chainable with two Thunderbolt 2 ports.

So with this in mind, my bill of materials looks like this:

And I completed this with a set of four 1 Tb drives that I had kicking around.

With all of this in hand, I started off with FreeNAS, but for some reason I couldn’t get it to install with a native EFI boot on the Mac Mini, so I ended up tweaking the configuration using the rEFInd Boot Manager to get FreeNAS running.

The basic install and configuration worked just fine, but for some reason I could never track down, the system would freeze once I started putting some load on it, whether local stress tests or copying data from another system. So about this time, I noticed that the latest Ubuntu 16 distribution includes ZFS natively and went to give that a spin.

In the first place, the Ubuntu installs natively and boots via EFI without a hitch so that simplified the setup a little bit. Then just a matter of installing the zfs utilities (apt-get install zfsutils-linux) and setting up an SSD pool and a couple of disk pools.

On my local network I am sadly behind the times and am still living in GbE-land, but my initial load tests of transferring some ZFS file systems (4 Tb of data) using zfs send/recv over netcat worked flawlessly and saturated the GbE link using either the built-in network port or the Sonnet.

Physical assembly

Similarly to the Mobile Lab I just built, I wanted to simplify the power cabling and management and limit the box to a single power plug. I did look at including a power bar and using the included adaptors, but that actually takes a lot of space and adds a significant amount of weight. Happily the Akitio boxes take 12V in, so it was just a matter of soldering some connectors onto the 12V lines out of the PSU and running one direct line from the plug over to the Mac Mini.

Then it was off to my metalworking shop to build a case to hold all of this which resulted in the following design:

 

Real life

I’ve got a project going where I’m working on a team that is consolidating a number of data centers where we deploy a local staging area equipped with 2 ESXi servers and a 24Tb ZFS based NAS. So from there I need to move the data to another data center and we’re leveraging the ability of ZFS to sync across different systems using the snapshots (as discussed here in my auto-replicate scripts). 

Given the volume of data involved, I do the initial replication manually using netcat instead of the script that uses ssh since ssh is CPU bound on a single core which limits the potential throughput. Using this method I was getting sustained network throughput of 500MB/sec. Yes that’s Megabytes, not bits. Peaks were hitting 750MB/sec. All of this through a Mac Mini…

Mobile NAS next to its big brother:

Miscellany

I’m usually trying to design systems to be as quiet as possible, and while there are fans on the Akitio boxes, they are low RPM and make hardly any noise. Using the included power adapters it’s actually very very quiet. In the final mobile configuration, the only thing that makes any noise is the fan on the PSU. So this setup could very well be used as a home NAS without having to hide it in the garage. If I were used this as a design spec for a home NAS, I’d probably start with an Intel NUC or Gigabyte BRIX with Thunderbolt3 using FreeNAS though for the simplicity of management and easy access to protocols other than NFS.

While it’s certainly easiest to do all of this over Ethernet, I can also extend the setup to be able to handle Fibre Channel with something like the Promise SANLink2 and the Linux FC Target software.

Thunderbolt 2 on the Mac Mini is extensible to up to 6 external devices on each chain so I could theoretically add two more Akitio boxes on the storage chain and another five more if I wanted to share the Thunderbolt connection I’m using for the network.

Thursday
Jun132013

Mac Pro 2013 Storage

There’s been a lot of talk about the new Mac Pro just announced at WWDC 2013 and I’m really liking what I see even if I have no real use for anything with that kind of horsepower.

But as usual, when Apple giveth, Apple taketh away. One big thing that’s currently missing from the newest iteration of the Mac Pro is internal storage expansion. Much noise has been made about the simplest types of solutions involving direct Thunderbolt connections to external drives (individually or multiple drive cases) and the resulting problems concerning cable mess, noise issues and the like.

I’m curious to see if Apple is going to be launching the Mac Pro with a suite of associated Thunderbolt peripherals since there is currently a dearth of products in this space since a super quiet complementary multi-disk storage system seems to be an obvious product. But in the meantime, we still have to cope with things like the fact that Thunderbolt over copper is limited to about 3m which can be problematic if you want (potentially noisy) expandable storage that’s not right beside you.

But even before the machine is released, we can imagine some useful and powerful solutions to these issues. In the enterprise world we do any awful lot with high-end NAS and SAN boxes but there are ways to profit from these technologies in a reasonable budget. Well, reasonable to someone ready to drop a few grand on a Mac Pro or two…

In any case, those of us with Mac Minis have already gone through this process of outgrowing storage that is handled by individual drives, and are often in spots that are inconvenient for hooking up external storage like home media servers.

So how to get there from here? The idea is to build an external storage box and using connectivity options that permit you to place it away from the office space where spinning disks and fans disturb the ambiance without penalizing performance. I’ve been using this approach for quite some time now but contenting with standard Gigabit Ethernet since my needs are limited to video and music streaming, plus some basic virtual machines for testing.

The big news that has surprised me is the appearance (finally) of 10GBase-T copper cards and switches. Yes, that’s 10GbE so more than plenty fast enough to handle most anything that a set of SATA drives can spit out like what we see in the last generation of Mac Pros.

Sticking with standard Ethernet CAT6 we can maintain 10GBase-T over a 55 meter cable. So we can easily put our storage a fair distance away from the office.

In order to use a standard PCIe expansion card we need a means of plugging it in. For this we have options like the Sonnet Echo Express SE which is a box with an 8x PCIe slot that you connect via Thunderbolt to the Mac Pro (or any Thunderbolt equipped machine for that matter).

There are a number of different 10GBase-T cards out there and one thing that remains to be determined is the driver availability for OS X. Sonnet proposes the Myricom Myri 10-G, but they don’t currently offer a 10GBase-T version. They are available with SFP+ Fiber (expensive) and CX4 (short cables). I did find a few cards available on Amazon like the Intel X540T1 at $353 or the HP G2 Dual Port card at $300 so there are options out there and I’m hoping that someone with deeper pockets than me will test the waters here.

I have a soft spot for using ZFS as my preferred storage technology for a number of reasons, including reliability and flexibility, but any number of server solutions are possible as long as they can publish a protocol OS X can talk to. With a build your own approach, you can find all sorts of boxes optimized for small, medium and massive storage options.

If your needs are relatively small, you could go the route of something like the HP N40L Microserver. Currently my setup with these machines are capable of saturating a standard GbE link (sequential IO) with 4 low RPM SATA drives, so there’s some headroom left going to 10GbE with speedier disks or even SSD.

Accessing the storage server

I prefer using NFS (but am waiting with great interest to see what 10.9’s SMB2 implementation will bring) for sticking with a NAS protocol approach, but if you prefer working with storage that is seen as disks by the OS, you can use iSCSI if you’re sticking with ethernet and TCP/IP as the transport. Fortunately, the GlobalSAN iSCSI Initiator will do the trick.

If you are dedicating the storage to a single machine, you can simply connect everything directly, but this kind of setup can be shared, but you’ll need a switch. Currently the best deal that I’ve seen out there for reasonably priced 10GbE switches is the Netgear 8 port 10GBase-T switch (~$900).

It’s true that all of this is considerably more complicated that simply popping off the side of the machine and connecting a new disk, or buying a little Thunderbolt external array, but if you need serious performance and serious capacity, even the old Mac Pro would reach some limits pretty quickly. Moving to a dedicated storage system permits better performance, sharing across multiple machines and many more options for growing the system over time.

I suspect that the majority of folks doing massive video work are already using some kind of SAN, whether Fibre Channel or iSCSI so the impact to them is mostly buying the Sonnet expansion box. It’s all the people in the middle who are currently making do with 3-4 disks who have to start asking a lot of questions about how to plan for storage management.

Monday
Oct292012

Nexenta 3.1.3 on a DELL R720

I’ve been a happy Nexenta user for quite some time, but there are a few use cases where the default tool kit has some issues. The biggest issue is that the core customer base is people that are building their own storage servers with COTS equipment (lots of SuperMicro) rather than branded stuff from the major vendors like DELL & HP.

As such you can run into problems where the HBAs use custom firmwares designed by the vendor rather than just the original stock LSI firmware which tends to get support very quickly. Unfortunately, in my current environment, the purchasing process precludes the white box approach and I’m limited to the branded solutions.

I’ve been fighting with getting the 3.1.3 release installed because the standard internal SAS card supplied by DELL (PERC H710) isn’t recognized. There is an updated mr_sas driver available, but the problem I ran into was just how to get the card recognized by the installer. There is a document referenced during the install process, but it’s awfully long and involved when the actual steps required are fairly simple.

Prerequisites

  • updated mr_sas driver on a usb key

Then it’s just a matter of running the following commands (modified for the name of your key):

cp /media/LEXAR4G/mr_sas /
hwdisco -d '"pciex1000,5d" "pciex1000,5b"' /mr_sas /mr_sas

This will inject the new version of the driver into the install process but it’s still not available to the currently running environment so the disk won’t be visible yet. To get around this I did the following:

cp /media/LEXAR4G/mr_sas /kernel/drv/amd64/
update_drv -f mr_sas

This will reload the driver and should attach the card so whatever virtual disks you have created will now be visible to the installer. F1 to get back to the installer and you’re good to go.

Side topic (the US is not the world)

The default installer, like many, makes the very annoying assumption that the entire world uses QWERTY keyboards. There is no option to select your keyboard during the boot process, and the international layouts aren’t even installed on the installer CD.

But it turns out that you can force load your local keyboard map manually and fortunately the Nexenta Installer CD does automount USB devices.

So I did took a spare USB key, connected it to a test Solaris VM, and did a quick rsync of /usr/share/lib/keytables over to the key.

Once you’ve booted the Nexenta Installer, connected the key, hitting F2 will put you at a console. From there the command :

loadkeys /media/LEXAR4G/keytables/type_6/france

will load up the standard fr keymap. Depending on your keyboard you might need to use the type_101 keymaps.

Note that once the install is complete, Nexenta does offer the possibility of choosing the console keyboard layout, it’s only during the install phase that this is an issue.

Wednesday
Sep262012

auto-replicate update

I just updated the auto-replicate script to add in the zfs holds function so that snapshots that have been used to replicate a file system have a hold positioned on them so they can’t be deleted.

It manages the holds so that once you’ve sent a new replication stream to a destination it will remove the old hold and add one to the latest snapshot.

This should obviate the need to deal with missing source snapshots.

Note: you can still break the system by taking snapshots on the destination volume or manually deleting snapshots on the destination. But you shouldn’t be doing that kind of thing anyway…

Wednesday
May092012

Back to backups

It’s been a while since I documented the current backup architecture at the house which has changed a little bit with the inclusion of two little HP Microservers running Solaris 11. I’m a big fan of the HP Microservers as they offer the reliability and flexibility of a true server, but the energy consumption and silence of a small NAS at an unbeatable price.

Overview

The core of the backup and operations is based on ZFS and it’s ability to take snapshots and replicate them asychronously. In addition the two servers use RAIDZ over four 2 Tb drives for resiliency.

Starting at the point furthest away from the offsite disaster recovery copies, the core of the day to day action is on a Mac Mini and a ZFS Microserver in the living room. The Mac Mini is connected to the TV and the stereo and it’s primary role is the media center. The iTunes library is far too large to fit conveniently on a single disk, so the Mini contains only the iTunes application and the library database. The actual contents of the iTunes library is stored on the ZFS server via an NFS mount which ensures that the path is consistent and auto-mounted, even before a user session is opened. AFP mounts are user dependent and open with the session and in case of conflicts will append a “-1” to the name listed in /Volumes which can cause all sorts of problems.

The Media ZFS filesystem is snapshotted and replicated every hour to the second server in the office. The snapshot retention is set to 4 days (96 hourly snapshots). So in the case of data corruption, I can easily roll back to any snapshot state in the last few days, or I can manually restore any files deleted by accident by browsing the snapshots. A key point here is that the ZFS filesystem architecture follows a block level changelog so that replication activity contains only the modified blocks and can be calculated on the fly during the replication operation. This means that there are no long evaluation cycles like those in traditional backup approaches using Time Machine or rsync.

iPhoto libraries are also stored on the server due to their size on a separate user volume and copied using the same methods.

Then there’s the question of the Mini’s backups. In order to minimize RPO and RTO, I have two approaches. One is that I use SuperDuper to clone the internal Mini SSD to an external 2.5” drive once per day. This permits an RTO of practically zero, if the internal drive dies, I can immediately reboot from the external drive with a maximum of 24 hours lag in the contents. To assuage the issue of data loss, the Mini is backed up every hour via Time Machine to the local ZFS server. I’m using the napp-it tool on the ZFS box to handle the installation and configuration of the Netatalk package to publish ZFS filesystems over AFP. Again, the backup volume is replicated hourly to the second server in the office.

RTO iTunes

Another advantage of this structure is that if the living room server dies, the only thing I need to do is to change the NFS mount point on the Mini to point to the server in the office and everything is back online. The catch is that because the house is very very old and I haven’t yet found an effective, discreet method for pulling GbE between the office and the living room, this connection is over Wifi, so there is a definite performance hit. But for music and up to 720p video it works just fine.

iOS

All of the iOS devices in the house are linked to the iTunes library on the Mini, including backups so they get a free ride on the backups of the Mini.

Portables

All of the MacBooks in the house are also backed up via Time Machine to volumes on the living room server, with hourly replication to the office so there are always two copies available at any moment.

The office

In the office I have the second ZFS server plus an older Mac Mini running OS X Server. The same strategy is applied on this Mac Mini as well. An external drive, duplicated via SuperDuper for a quick return to service, but I’ve had issues with the sheer number of files on the server causing problems with Time Machine, so I also use SuperDuper to clone the server to a disk image on the ZFS server.

I have a number of virtual machines for lab work stored on the ZFS server in the office in various formats (VirtualBox, ESX, Fusion, Xen, …) on dedicated volumes on the ZFS server, accessed via NFS. I’ve played with iSCSI on this system and it works well, but NFS is considerably more flexible and any performance difference is negligable. Currently the virtualisation host is a old white box machine, but I’m dreaming of building a proper ESX High Availability cluster using two Mac Minis based on the news that I can install ESXi 5 on the latest generation and virtualize OS X instances as well as my Linux and Windows VMs.

Offsite

No serious backup plan would be complete without an offsite component. I currently use a simple USB dual drive dock to hold the backup zpool (striped for maximum space) made up of two 2 Tb drives. They receive an incremental update to all of the filesystems on a daily basis, but only retain the most recent snapshot.

These disks are swapped out on a weekly or bi-weekly basis. With the contents of these two disks I can reconstruct my entire environment using any PC that I can install Solaris.

The best part of this backup structure is that it requires practically no intervention on my part at all. I receive email notifications of the replication transactions so if anything goes wrong I’ll spot it in the logs. The only real work on my part is swapping out the offsite disks on a regular basis, but even there, the process is forgiving and I can swap the disks at any time as there is no hard schedule that has to be followed.

Resources