NAS on RBD

blair.bethwaite@xxxxxxxxx (Blair Bethwaite) · Tue, 9 Sep 2014 22:46:46 +1000

Hi Dan,

Thanks for sharing!

On 9 September 2014 20:12, Dan Van Der Ster <daniel.vanderster at cern.ch> wrote:
> We do this for some small scale NAS use-cases, with ZFS running in a VM with rbd volumes. The performance is not great (especially since we throttle the IOPS of our RBD). We also tried a few kRBD / ZFS servers with an SSD ZIL ? the SSD solves any performance problem we ever had with ZFS on RBD.

That's good to hear. My limited experience doing this on a smaller
Ceph cluster (and without any SSD journals or cache devices for ZFS
head) points to write latency being an immediate issue, decent PCIe
SLC SSD devices should pretty much sort that out given the cluster
itself has plenty of write throughput available. Then there's further
MLC devices for L2ARC - not sure yet but guessing metadata heavy
datasets might require primarycache=metadata and rely of L2ARC for
data cache. And all this should get better in the medium term with
performance improvements and RDMA capability (we're building this with
that option in the hole).

> I would say though that this setup is rather adventurous. ZoL is not rock solid ? we?ve had a few lockups in testing, all of which have been fixed in the latest ZFS code in git (my colleague in CC could elaborate if you?re interested).

Hmm okay, that's not great. The only problem I've experienced thus far
is when the ZoL repos stopped providing DKMS and borked an upgrade for
me until I figured out what had happened and cleaned up the old .ko
files. So yes, interested to hear elaboration on that.

>One thing I?m not comfortable with is the idea of ZFS checking the data in addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but without any redundancy at the ZFS layer there will be no way to correct that error. Of course, the hope is that RADOS will ensure 100% data consistency, but what happens if not?...

The ZFS checksumming would tell us if there has been any corruption,
which as you've pointed out shouldn't happen anyway on top of Ceph.
But if we did have some awful disaster scenario where that happened
then we'd be restoring from tape, and it'd sure be good to know which
files actually needed restoring. I.e., if we lost a single PG at the
Ceph level then we don't want to have to blindly restore the whole
zpool or dataset.

> Personally, I think you?re very brave to consider running 2PB of ZoL on RBD. If I were you I would seriously evaluate the CephFS option. It used to be on the roadmap for ICE 2.0 coming out this fall, though I noticed its not there anymore (??!!!).

Yeah, it's very disappointing that this was silently removed. And it's
particularly concerning that this happened post RedHat acquisition.
I'm an ICE customer and sure would have liked some input there for
exactly the reason we're discussing.

> Anyway I would say that ZoL on kRBD is not necessarily a more stable solution than CephFS. Even Gluster striped on top of RBD would probably be more stable than ZoL on RBD.

If we really have to we'll just run Gluster natively instead (or
perhaps XFS on RBD as the option before that) - the hardware needn't
change for that except to configure RAIDs rather than JBODs on the
servers.

-- 
Cheers,
~Blairo