Re: use ZFS for OSDs

Christian Balzer <chibi@xxxxxxx> · Sat, 1 Nov 2014 17:43:04 +0900

On Fri, 31 Oct 2014 16:32:49 +0000 Michal Kozanecki wrote:

> I'll test this by manually inducing corrupted data to the ZFS filesystem
> and report back how ZFS+ceph interact during a detected file
> failure/corruption, how it recovers and any manual steps required, and
> report back with the results. 
> 
Looking forward to that.

> As for compression, using lz4 the CPU impact is around 5-20% depending
> on load, type of I/O and I/O size, with little-to-no I/O performance
> impact, and in fact in some cases the I/O performance actually
> increases. I'm currently looking at a compression ratio on the ZFS
> datasets of around 30-35% for a data consisting of rbd backed OpenStack
> KVM VMs. 

I'm looking at a similar deployment (VM images) and over 30% compression
will at least negate the need of ZFS to have at least 20% free space or
suffer massive degradation otherwise.

CPU usage looks acceptable, however in combination with SSD backed OSDs
that's another thing to consider.
As in, is it worth to spend X amount of money for faster CPUs and 10-20%
space savings or will be another SSD be cheaper?

I'm trying to position Ceph against SolidFire, who are claiming 4-10 times
data reduction by a combination of compression, deduping and thin
provisioning. 
Without of course quantifying things, like what step gives which reduction
based on what sample data.

> I have not tried any sort of dedupe as it is memory intensive
> and I only had 24GB of ram on each node. I'll grab some FIO benchmarks
> and report back.
> 
I foresee a massive failure here, despite a huge potential given one use
case here where all VMs are basically identical (KSM is very effective
with those, too).
Why the predicted failure? Several reasons:

1. Deduping is only local, per OSD. 
That will make a big dent, but with many nearly identical VM images we
should still have a quite a bit of identical data per OSD. However...

2. Data alignment.
The default RADOS objects making up images are 4MB. Which, given my limited
knowledge of ZFS, I presume will be mapped to 128KB ZFS blocks which are
then subject to the deduping process. 
However even if one were to install the same OS on the same sized RBD
images I predict subtle differences in alignment within those objects and
thus ZFS blocks.
That becomes a near certainty when those images (OS installs) are
customized, files being added or deleted, etc.

3. ZFS block size and VM FS metadata.
Even if all the data would be perfectly, identically aligned in the 4MB
RADOS objects the resulting 128KB ZFS blocks are likely to contain
metadata like inodes (creation time) in them, thus making them subtly
different and not eligible for deduping. 

OTOH SolidFire claims to be doing global deduplication, how they do that
efficiently is a bit beyond, especially given the memory sizes of their
appliances. My guess is they keep a map on disk (all SSDs) on each node
instead of keeping it in RAM. 
I suppose the updates (writes to the SSDs) of this map are still
substantially less than the data otherwise written w/o deduping.

Thusly I think Ceph will need a similar approach for any deduping to work,
in combination with a much finer grained "block size". 
The later I believe is already being discussed in the context of cache
tier pools, having to promote/demote 4MB blobs for a single hot 4KB of
data is hardly efficient.

Regards,

Christian

> Cheers,
> 
> 
> 
> -----Original Message-----
> From: Christian Balzer [mailto:chibi@xxxxxxx] 
> Sent: October-30-14 4:12 AM
> To: ceph-users
> Cc: Michal Kozanecki
> Subject: Re:  use ZFS for OSDs
> 
> On Wed, 29 Oct 2014 15:32:57 +0000 Michal Kozanecki wrote:
> 
> [snip]
> > With Ceph handling the
> > redundancy at the OSD level I saw no need for using ZFS mirroring or 
> > zraid, instead if ZFS detects corruption instead of self-healing it 
> > sends a read failure of the pg file to ceph, and then ceph's scrub 
> > mechanisms should then repair/replace the pg file using a good replica 
> > elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting 
> > match!
> > 
> Could you elaborate on that? 
> AFAIK Ceph currently has no way to determine which of the replicas is
> "good", one such failed PG object will require you to do a manual repair
> after the scrub and hope that two surviving replicas (assuming a size of
> 3) are identical. If not, start tossing a coin. Ideally Ceph would have
> a way to know what happened (as in, it's a checksum and not a real I/O
> error) and do a rebuild of that object itself.
> 
> On an other note, have you done any tests using the ZFS compression?
> I'm wondering what the performance impact and efficiency are.
> 
> Christian

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com