Re: Sparse file info in filestore not propagated to other OSDs

Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> · Wed, 21 Jun 2017 15:46:13 +0200

On 17-06-21 03:24 PM, Sage Weil wrote:
On Wed, 21 Jun 2017, Piotr Dałek wrote:
On 17-06-14 03:44 PM, Sage Weil wrote:
On Wed, 14 Jun 2017, Paweł Sadowski wrote:
On 04/13/2017 04:23 PM, Piotr Dałek wrote:
On 04/06/2017 03:25 PM, Sage Weil wrote:
On Thu, 6 Apr 2017, Piotr Dałek wrote:
[snip]

I think the solution here is to use sparse_read during recovery.  The
PushOp data representation already supports it; it's just a matter of
skipping the zeros.  The recovery code could also have an option to
check
for fully-zero regions of the data and turn those into holes as
well.  For
ReplicatedBackend, see build_push_op().

So far it turns out that there's even easier solution, we just enabled
"filestore seek hole" on some test cluster and that seems to fix the
problem for us. We'll see if fiemap works too.

Is it safe to enable "filestore seek hole", are there any tests that
verifies that everything related to RBD works fine with this enabled?
Can we make this enabled by default?

We would need to enable it in the qa environment first.  The risk here is
that users run a broad range of kernels and we are exposing ourselves to
any bugs in any kernel version they may run.  I'd prefer to leave it off
by default.

That's a common regression? If not, we could blacklist particular kernels and
call it a day.
>>
We can enable it in the qa suite, though, which covers
centos7 (latest kernel) and ubuntu xenial and trusty.

+1. Do you need some particular PR for that?

Sure.  How about a patch that adds the config option to several of the
files in qa/suites/rados/thrash/thrashers?

OK.

I tested on few of our production images and it seems that about 30% is
sparse. This will be lost on any cluster wide event (add/remove nodes,
PG grow, recovery).

How this is/will be handled in BlueStore?

BlueStore exposes the same sparseness metadata that enabling the
filestore seek hole or fiemap options does, so it won't be a problem
there.

I think the only thing that we could potentially add is zero detection
on writes (so that explicitly writing zeros consumes no space).  We'd
have to be a bit careful measuring the performance impact of that check on
non-zero writes.

I saw that RBD (librbd) does that - replacing writes with discards when buffer
contains only zeros. Some code that does the same in librados could be added
and it shouldn't impact performance much, current implementation of
mem_is_zero is fast and shouldn't be a big problem.

I'd rather not have librados silently translating requests; I think it
makes more sense to do any zero checking in bluestore.  _do_write_small
and _do_write_big already break writes into (aligned) chunks; that would
be an easy place to add the check.

That leaves out filestore.

And while I get your point, doing it on librados level would reduce network 
usage for zeroed out regions as well, and check could be done just once, not 
replica_size times...

--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovh.com/us/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html