On Fri, 31 Oct 2014 16:32:49 +0000 Michal Kozanecki wrote: > I'll test this by manually inducing corrupted data to the ZFS filesystem > and report back how ZFS+ceph interact during a detected file > failure/corruption, how it recovers and any manual steps required, and > report back with the results. > Looking forward to that. > As for compression, using lz4 the CPU impact is around 5-20% depending > on load, type of I/O and I/O size, with little-to-no I/O performance > impact, and in fact in some cases the I/O performance actually > increases. I'm currently looking at a compression ratio on the ZFS > datasets of around 30-35% for a data consisting of rbd backed OpenStack > KVM VMs. I'm looking at a similar deployment (VM images) and over 30% compression will at least negate the need of ZFS to have at least 20% free space or suffer massive degradation otherwise. CPU usage looks acceptable, however in combination with SSD backed OSDs that's another thing to consider. As in, is it worth to spend X amount of money for faster CPUs and 10-20% space savings or will be another SSD be cheaper? I'm trying to position Ceph against SolidFire, who are claiming 4-10 times data reduction by a combination of compression, deduping and thin provisioning. Without of course quantifying things, like what step gives which reduction based on what sample data. > I have not tried any sort of dedupe as it is memory intensive > and I only had 24GB of ram on each node. I'll grab some FIO benchmarks > and report back. > I foresee a massive failure here, despite a huge potential given one use case here where all VMs are basically identical (KSM is very effective with those, too). Why the predicted failure? Several reasons: 1. Deduping is only local, per OSD. That will make a big dent, but with many nearly identical VM images we should still have a quite a bit of identical data per OSD. However... 2. Data alignment. The default RADOS objects making up images are 4MB. Which, given my limited knowledge of ZFS, I presume will be mapped to 128KB ZFS blocks which are then subject to the deduping process. However even if one were to install the same OS on the same sized RBD images I predict subtle differences in alignment within those objects and thus ZFS blocks. That becomes a near certainty when those images (OS installs) are customized, files being added or deleted, etc. 3. ZFS block size and VM FS metadata. Even if all the data would be perfectly, identically aligned in the 4MB RADOS objects the resulting 128KB ZFS blocks are likely to contain metadata like inodes (creation time) in them, thus making them subtly different and not eligible for deduping. OTOH SolidFire claims to be doing global deduplication, how they do that efficiently is a bit beyond, especially given the memory sizes of their appliances. My guess is they keep a map on disk (all SSDs) on each node instead of keeping it in RAM. I suppose the updates (writes to the SSDs) of this map are still substantially less than the data otherwise written w/o deduping. Thusly I think Ceph will need a similar approach for any deduping to work, in combination with a much finer grained "block size". The later I believe is already being discussed in the context of cache tier pools, having to promote/demote 4MB blobs for a single hot 4KB of data is hardly efficient. Regards, Christian > Cheers, > > > > -----Original Message----- > From: Christian Balzer [mailto:chibi@xxxxxxx] > Sent: October-30-14 4:12 AM > To: ceph-users > Cc: Michal Kozanecki > Subject: Re: use ZFS for OSDs > > On Wed, 29 Oct 2014 15:32:57 +0000 Michal Kozanecki wrote: > > [snip] > > With Ceph handling the > > redundancy at the OSD level I saw no need for using ZFS mirroring or > > zraid, instead if ZFS detects corruption instead of self-healing it > > sends a read failure of the pg file to ceph, and then ceph's scrub > > mechanisms should then repair/replace the pg file using a good replica > > elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting > > match! > > > Could you elaborate on that? > AFAIK Ceph currently has no way to determine which of the replicas is > "good", one such failed PG object will require you to do a manual repair > after the scrub and hope that two surviving replicas (assuming a size of > 3) are identical. If not, start tossing a coin. Ideally Ceph would have > a way to know what happened (as in, it's a checksum and not a real I/O > error) and do a rebuild of that object itself. > > On an other note, have you done any tests using the ZFS compression? > I'm wondering what the performance impact and efficiency are. > > Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com