Hi Dan, Thanks for sharing! On 9 September 2014 20:12, Dan Van Der Ster <daniel.vanderster at cern.ch> wrote: > We do this for some small scale NAS use-cases, with ZFS running in a VM with rbd volumes. The performance is not great (especially since we throttle the IOPS of our RBD). We also tried a few kRBD / ZFS servers with an SSD ZIL ? the SSD solves any performance problem we ever had with ZFS on RBD. That's good to hear. My limited experience doing this on a smaller Ceph cluster (and without any SSD journals or cache devices for ZFS head) points to write latency being an immediate issue, decent PCIe SLC SSD devices should pretty much sort that out given the cluster itself has plenty of write throughput available. Then there's further MLC devices for L2ARC - not sure yet but guessing metadata heavy datasets might require primarycache=metadata and rely of L2ARC for data cache. And all this should get better in the medium term with performance improvements and RDMA capability (we're building this with that option in the hole). > I would say though that this setup is rather adventurous. ZoL is not rock solid ? we?ve had a few lockups in testing, all of which have been fixed in the latest ZFS code in git (my colleague in CC could elaborate if you?re interested). Hmm okay, that's not great. The only problem I've experienced thus far is when the ZoL repos stopped providing DKMS and borked an upgrade for me until I figured out what had happened and cleaned up the old .ko files. So yes, interested to hear elaboration on that. >One thing I?m not comfortable with is the idea of ZFS checking the data in addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but without any redundancy at the ZFS layer there will be no way to correct that error. Of course, the hope is that RADOS will ensure 100% data consistency, but what happens if not?... The ZFS checksumming would tell us if there has been any corruption, which as you've pointed out shouldn't happen anyway on top of Ceph. But if we did have some awful disaster scenario where that happened then we'd be restoring from tape, and it'd sure be good to know which files actually needed restoring. I.e., if we lost a single PG at the Ceph level then we don't want to have to blindly restore the whole zpool or dataset. > Personally, I think you?re very brave to consider running 2PB of ZoL on RBD. If I were you I would seriously evaluate the CephFS option. It used to be on the roadmap for ICE 2.0 coming out this fall, though I noticed its not there anymore (??!!!). Yeah, it's very disappointing that this was silently removed. And it's particularly concerning that this happened post RedHat acquisition. I'm an ICE customer and sure would have liked some input there for exactly the reason we're discussing. > Anyway I would say that ZoL on kRBD is not necessarily a more stable solution than CephFS. Even Gluster striped on top of RBD would probably be more stable than ZoL on RBD. If we really have to we'll just run Gluster natively instead (or perhaps XFS on RBD as the option before that) - the hardware needn't change for that except to configure RAIDs rather than JBODs on the servers. -- Cheers, ~Blairo