Re: Building a petabyte cluster from scratch

Phil Regnauld <pr@xxxxx> · Wed, 4 Dec 2019 15:18:19 +0300



Darren Soothill (darren.soothill) writes:
> Hi Fabien,
> 
> ZFS ontop of RBD really makes me shudder. ZFS expects to have individual disk devices that it can manage. It thinks it has them with this config but CEPH is masking the real data behind it.
> 
> As has been said before why not just use Samba directly from CephFS and remove that layer of complexity in the middle.

	As a user of ZFS on ceph, I can explain some of our motivation.

	As it was pointed out earlier in this thread CephFS will give you snapshots
	but not diffs between them. I don't know what the intent was with using
	diffs, but in ZFS' case, snapshots provide a basis for checkpointing/
	recovery, instant dataset cloning, but also for replication/offsite
	mirroring (although not synchronous) - so could easily back up/replicate
	the ZFS datasets to another location that doesn't necessarily have a CEPH
	installation (say, big, cheap JBOD box with SMR drives running native ZFS).
	And, you can diff between snapshots to see instantly which files were
	modified. In addition to the other benefits of running ZFS such as lz4
	compression (per dataset), deduplication, etc.

	While it's true that ZFS on top of RBD is not optimal, it's not
	particularly dangerous or unreliable. You provide it with multiple RBDs,
	create a pool out of those (ZFS pool, not ceph pool :). It sees each
	RBD as an individual disk, and can issue I/O to those indepdently.

	If anything, you lose some of the benefits of ZFS (automatic error
	correction - everything is still checksummed and you detect corruption).

	I already run ZFS within a VM (all our customers are hosted like this,
	using LXD or FreeBSD jails), whether the backing store is NFS, local disk
	or RBD doesn't really matter.

	So why NOT run ZFS on top of RBD ? Complexity mostly, and some measure
	of lost performance... But CephFS isn't exactly simple stuff to run in a
	reliable manner as of yet (MDS performance and possible deadlocks are
	an issue).

	If you're planning on serving files, you're still going to need an NFS
	or SMB layer. If you're on CEPHFS, you can serve via Ganesha or Samba
	without adding the extra ZFS layering which will add latency, but either
	way you're still going to drag the data out of cephfs to the client
	mounting the FS, export that via Samba/NFS. If instead you attach, say,
	10 x 1 TB RBD images from a host, assemble those into a zfs pool, and
	run NFS or Samba on top of that, you'll have more or less the same data
	path, but in addition you'll be going through ZFS which introduces latency. 

	Now, if you're daring, you create a ceph pool with size=1, min size=1
	(will ceph let you do that ? :), you map RBDs out of that, hand them over
	to ZFS in a raid+mirror config (or raidz2) - and let ZFS deal with
	failings VDEVs by giving it new RBDs to replace them. Sounds crazy ?
	Well, you lose the benefit of CEPH's self-healing, but you still get
	a super scalable ZFS running on a near limitless supply of JBOD :) And,
	you can quickly set up different (zfs) pools with different levels of
	redundancy, quotas, compression, metadata options, etc...

	Who says you can't do both anyway ? (CephFS and ZFS), CEPH is flexible
	enough...
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx