Re: Sparse file info in filestore not propagated to other OSDs

Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> · Thu, 6 Apr 2017 16:24:06 +0200

On 04/06/2017 03:55 PM, Sage Weil wrote:
On Thu, 6 Apr 2017, Piotr Dałek wrote:
On 04/06/2017 03:25 PM, Sage Weil wrote:
On Thu, 6 Apr 2017, Piotr Dałek wrote:
Hello,

We recently had an interesting issue with RBD images and filestore on
Jewel
10.2.5:
We have a pool with RBD images, all of them mostly untouched (large areas
of
those images unused), and once we added 3 new OSDs to cluster, objects
representing these images grew substantially on new OSDs: objects hosting
unused areas of these images on original OSDs remained small (~8K of space
actually used, 4M allocated), but on new OSDs were large (4M allocated
*and*
actually used). After investigation we concluded that Ceph didn't
propagate
sparse file information during cluster rebalance, resulting in correct
data
contents on all OSDs, but no sparse file data on new OSDs, hence disk
space
usage increase on those.

[..]

I think the solution here is to use sparse_read during recovery.  The
PushOp data representation already supports it; it's just a matter of
skipping the zeros.  The recovery code could also have an option to check
for fully-zero regions of the data and turn those into holes as well.  For
ReplicatedBackend, see build_push_op().

Can we abuse that to reduce amount of regular (client/inter-osd) network
traffic?

Yeah... I wouldn't call it abuse :).  sparse_read() will use
SEEK_HOLE/SEEK_DATA on filestore (if enabled).  On bluestore we have the
metadata on-hand.  It may be a bit slower, though... more complexity
and such.  They recently implemented something like this for the kernel
NFS server and found it was faster for very sparse files but the rest of
the time it was a fair bit slower.

I was wondering if we could modify regular reads in a way that makes them 
work as it used to work, but not transmit zeroed out pages/blocks/objects 
(in other words, you still would get bufferptrs full of zeroes, but they 
wouldn't be transmitted as such over the wire; specialized case of RLE 
compression). That shouldn't be so much slower. But I don't really see how 
that would work without protocol change... Well, at least it's possible to 
replace some of calls to read with sparse read, utilizing filesystem/file 
store metadata to do heavy lifting for us.

--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovh.com/us/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html