Re: Sparse file info in filestore not propagated to other OSDs

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 6 Apr 2017 14:27:44 +0000 (UTC)

On Thu, 6 Apr 2017, Piotr Dałek wrote:
> On 04/06/2017 03:55 PM, Sage Weil wrote:
> > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > On 04/06/2017 03:25 PM, Sage Weil wrote:
> > > > On Thu, 6 Apr 2017, Piotr Dałek wrote:
> > > > > Hello,
> > > > > 
> > > > > We recently had an interesting issue with RBD images and filestore on
> > > > > Jewel
> > > > > 10.2.5:
> > > > > We have a pool with RBD images, all of them mostly untouched (large
> > > > > areas
> > > > > of
> > > > > those images unused), and once we added 3 new OSDs to cluster, objects
> > > > > representing these images grew substantially on new OSDs: objects
> > > > > hosting
> > > > > unused areas of these images on original OSDs remained small (~8K of
> > > > > space
> > > > > actually used, 4M allocated), but on new OSDs were large (4M allocated
> > > > > *and*
> > > > > actually used). After investigation we concluded that Ceph didn't
> > > > > propagate
> > > > > sparse file information during cluster rebalance, resulting in correct
> > > > > data
> > > > > contents on all OSDs, but no sparse file data on new OSDs, hence disk
> > > > > space
> > > > > usage increase on those.
> > > > > 
> > > > > [..]
> > > > 
> > > > I think the solution here is to use sparse_read during recovery.  The
> > > > PushOp data representation already supports it; it's just a matter of
> > > > skipping the zeros.  The recovery code could also have an option to
> > > > check
> > > > for fully-zero regions of the data and turn those into holes as well.
> > > > For
> > > > ReplicatedBackend, see build_push_op().
> > > 
> > > Can we abuse that to reduce amount of regular (client/inter-osd) network
> > > traffic?
> > 
> > Yeah... I wouldn't call it abuse :).  sparse_read() will use
> > SEEK_HOLE/SEEK_DATA on filestore (if enabled).  On bluestore we have the
> > metadata on-hand.  It may be a bit slower, though... more complexity
> > and such.  They recently implemented something like this for the kernel
> > NFS server and found it was faster for very sparse files but the rest of
> > the time it was a fair bit slower.
> 
> I was wondering if we could modify regular reads in a way that makes them work
> as it used to work, but not transmit zeroed out pages/blocks/objects (in other
> words, you still would get bufferptrs full of zeroes, but they wouldn't be
> transmitted as such over the wire; specialized case of RLE compression). That
> shouldn't be so much slower. But I don't really see how that would work
> without protocol change... Well, at least it's possible to replace some of
> calls to read with sparse read, utilizing filesystem/file store metadata to do
> heavy lifting for us.

IIRC librbd used to have an option to do sparse-read all the time instead 
of read (I think this was in ObjectCacher somewhere?) but I think it got 
turned off for some reason?  Memory is very fuzzy here.  In any case, 
changing the client to use sparse-read is the way to do it, I think.  
I'm a bit skeptical that this will have much of an impact, though.

sage