Re: Sparse file info in filestore not propagated to other OSDs

Jason Dillaman <jdillama@xxxxxxxxxx> · Thu, 6 Apr 2017 11:50:00 -0400

I don't recall a configuration option, but librbd always uses sparse
reads when the cache is disabled and never uses sparse reads for
cache-based reads. I'm pretty sure there wasn't a rationale for the
split -- instead, I've always assumed it was an oversight when that
feature was added years ago.

On Thu, Apr 6, 2017 at 10:27 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> On 04/06/2017 03:55 PM, Sage Weil wrote:
>> > On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> > > On 04/06/2017 03:25 PM, Sage Weil wrote:
>> > > > On Thu, 6 Apr 2017, Piotr Dałek wrote:
>> > > > > Hello,
>> > > > >
>> > > > > We recently had an interesting issue with RBD images and filestore on
>> > > > > Jewel
>> > > > > 10.2.5:
>> > > > > We have a pool with RBD images, all of them mostly untouched (large
>> > > > > areas
>> > > > > of
>> > > > > those images unused), and once we added 3 new OSDs to cluster, objects
>> > > > > representing these images grew substantially on new OSDs: objects
>> > > > > hosting
>> > > > > unused areas of these images on original OSDs remained small (~8K of
>> > > > > space
>> > > > > actually used, 4M allocated), but on new OSDs were large (4M allocated
>> > > > > *and*
>> > > > > actually used). After investigation we concluded that Ceph didn't
>> > > > > propagate
>> > > > > sparse file information during cluster rebalance, resulting in correct
>> > > > > data
>> > > > > contents on all OSDs, but no sparse file data on new OSDs, hence disk
>> > > > > space
>> > > > > usage increase on those.
>> > > > >
>> > > > > [..]
>> > > >
>> > > > I think the solution here is to use sparse_read during recovery.  The
>> > > > PushOp data representation already supports it; it's just a matter of
>> > > > skipping the zeros.  The recovery code could also have an option to
>> > > > check
>> > > > for fully-zero regions of the data and turn those into holes as well.
>> > > > For
>> > > > ReplicatedBackend, see build_push_op().
>> > >
>> > > Can we abuse that to reduce amount of regular (client/inter-osd) network
>> > > traffic?
>> >
>> > Yeah... I wouldn't call it abuse :).  sparse_read() will use
>> > SEEK_HOLE/SEEK_DATA on filestore (if enabled).  On bluestore we have the
>> > metadata on-hand.  It may be a bit slower, though... more complexity
>> > and such.  They recently implemented something like this for the kernel
>> > NFS server and found it was faster for very sparse files but the rest of
>> > the time it was a fair bit slower.
>>
>> I was wondering if we could modify regular reads in a way that makes them work
>> as it used to work, but not transmit zeroed out pages/blocks/objects (in other
>> words, you still would get bufferptrs full of zeroes, but they wouldn't be
>> transmitted as such over the wire; specialized case of RLE compression). That
>> shouldn't be so much slower. But I don't really see how that would work
>> without protocol change... Well, at least it's possible to replace some of
>> calls to read with sparse read, utilizing filesystem/file store metadata to do
>> heavy lifting for us.
>
> IIRC librbd used to have an option to do sparse-read all the time instead
> of read (I think this was in ObjectCacher somewhere?) but I think it got
> turned off for some reason?  Memory is very fuzzy here.  In any case,
> changing the client to use sparse-read is the way to do it, I think.
> I'm a bit skeptical that this will have much of an impact, though.
>
> sage

-- 
Jason
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html