NAS on RBD

mkozanecki@xxxxxxxxxx (Michal Kozanecki) · Tue, 9 Sep 2014 14:39:35 +0000

Hi Blair!

On 9 September 2014 08:47, Blair Bethwaite <blair.bethwaite at gmail.com> wrote:
> Hi Dan,
>
> Thanks for sharing!
>
> On 9 September 2014 20:12, Dan Van Der Ster <daniel.vanderster at cern.ch> wrote:
>> We do this for some small scale NAS use-cases, with ZFS running in a VM with rbd volumes. The performance is not great (especially since we throttle the IOPS of our RBD). We also tried a few kRBD / ZFS servers with an SSD ZIL ? the SSD solves any performance problem we ever had with ZFS on RBD.
>
> That's good to hear. My limited experience doing this on a smaller Ceph cluster (and without any SSD journals or cache devices for ZFS
> head) points to write latency being an immediate issue, decent PCIe SLC SSD devices should pretty much sort that out given the cluster itself has plenty of write throughput available. Then there's further MLC devices for L2ARC - not sure yet but guessing metadata heavy datasets might require primarycache=metadata and rely of L2ARC for data cache. And all this should get better in the medium term with performance improvements and RDMA capability (we're building this with that option in the hole).
>

I'd love to go back and forth with you privately or on one of the ZFS mailing-lists if you want to discuss ZFS tuning in depth, but I want to just mention that setting primarycache=metadata will also cause the L2ARC to ONLY store and accelerate metadata as well(despite whatever secondarycache is set to). I believe this is something that the ZFS developers are looking to improve eventually but as-is, currently that?s how it works (L2ARC only contains what was pushed out of the main in-memory ARC). 

>> I would say though that this setup is rather adventurous. ZoL is not rock solid ? we?ve had a few lockups in testing, all of which have been fixed in the latest ZFS code in git (my colleague in CC could elaborate if you?re interested).
>
> Hmm okay, that's not great. The only problem I've experienced thus far is when the ZoL repos stopped providing DKMS and borked an upgrade for me until I figured out what had happened and cleaned up the old .ko files. So yes, interested to hear elaboration on that.
>

You mentioned in one of your other emails that if you deployed this idea of a ZFS NFS server, you'd do it inside a KVM VM and make use of librbd rather than krbd. If you're worried about ZoL stability and feel comfortable going outside Linux, you could always go with a *BSD or Illumos distro where ZFS support is much more stable/solid. 
In any case I haven't had any major show stopping issues with ZoL myself and I use it heavily. Still, unless you're really comfortable with ZoL or *BSD/Illumos(as I am), I'd likely recommend looking into other solutions.

>> One thing I?m not comfortable with is the idea of ZFS checking the data in addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but without any redundancy at the ZFS layer there will be no way to correct that error. Of course, the hope is that RADOS will ensure 100% data consistency, but what happens if not?...
> 
> The ZFS checksumming would tell us if there has been any corruption, which as you've pointed out shouldn't happen anyway on top of Ceph.

Just want to quickly address this, someone correct me if I'm wrong, but IIRC even with replica value of 3 or more, ceph does not(currently) have any intelligence when it detects a corrupted/"incorrect" PG, it will always replace/repair the PG with whatever data is in the primary, meaning that if the primary PG is the one that?s corrupted/bit-rotted/"incorrect", it will replace the good replicas with the bad.  

> But if we did have some awful disaster scenario where that happened then we'd be restoring from tape, and it'd sure be good to know which files actually needed restoring. I.e., if we lost a single PG at the Ceph level then we don't want to have to blindly restore the whole zpool or dataset.
>
>> Personally, I think you?re very brave to consider running 2PB of ZoL on RBD. If I were you I would seriously evaluate the CephFS option. It used to be on the roadmap for ICE 2.0 coming out this fall, though I noticed its not there anymore (??!!!).
>
> Yeah, it's very disappointing that this was silently removed. And it's particularly concerning that this happened post RedHat acquisition.
> I'm an ICE customer and sure would have liked some input there for exactly the reason we're discussing.
>

I'm looking forward to CephFS as well, and I agree, it's somewhat concerning that it happened post RedHat acquisition. I'm hoping RedHat pours more resources into InkTank and ceph, and not instead leach resources away from them.

>> Anyway I would say that ZoL on kRBD is not necessarily a more stable solution than CephFS. Even Gluster striped on top of RBD would probably be more stable than ZoL on RBD.
>
> If we really have to we'll just run Gluster natively instead (or perhaps XFS on RBD as the option before that) - the hardware needn't change for that except to configure RAIDs rather than JBODs on the servers.

Really, I would look into RBD backed HA NFS based solutions like Christian Balzer brought up in one of the previous emails. I'm sure setting up a couple librbd KVM backed VMs in a Active/Passive or Active+Passive/Passive+Active type NFS solution wouldn?t be too hard to set-up and would likely be the more stable solution.

>
> --
> Cheers,
> ~Blairo

Cheers
_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com