Re: Sparse file info in filestore not propagated to other OSDs

Paweł Sadowski <ceph@xxxxxxxxx> · Wed, 14 Jun 2017 08:30:10 +0200

On 04/13/2017 04:23 PM, Piotr Dałek wrote:
> On 04/06/2017 03:25 PM, Sage Weil wrote:
>> On Thu, 6 Apr 2017, Piotr Dałek wrote:
>>> Hello,
>>>
>>> We recently had an interesting issue with RBD images and filestore
>>> on Jewel
>>> 10.2.5:
>>> We have a pool with RBD images, all of them mostly untouched (large
>>> areas of
>>> those images unused), and once we added 3 new OSDs to cluster, objects
>>> representing these images grew substantially on new OSDs: objects
>>> hosting
>>> unused areas of these images on original OSDs remained small (~8K of
>>> space
>>> actually used, 4M allocated), but on new OSDs were large (4M
>>> allocated *and*
>>> actually used). After investigation we concluded that Ceph didn't
>>> propagate
>>> sparse file information during cluster rebalance, resulting in
>>> correct data
>>> contents on all OSDs, but no sparse file data on new OSDs, hence
>>> disk space
>>> usage increase on those.
>>>
>>> Example on test cluster, before growing it by one OSD:
>>>
>>> ls:
>>>
>>> osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>> du:
>>>
>>> osd-01-cluster: 12
>>> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-02-cluster: 12
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: 12
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>>
>>> mon-01-cluster:~ # rbd diff test
>>> Offset   Length  Type
>>> 8388608  4194304 data
>>> 16777216 4096    data
>>> 33554432 4194304 data
>>> 37748736 2048    data
>>>
>>> And after growing it:
>>>
>>> ls:
>>>
>>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name
>>> '*data*' -exec
>>> ls -l {} \+
>>> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:25
>>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>> du:
>>>
>>> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name
>>> '*data*' -exec
>>> du -k {} \+
>>> osd-02-cluster: 12
>>> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-03-cluster: 12
>>> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>> osd-04-cluster: 4100
>>> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
>>>
>>>
>>> Note that
>>> "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
>>> from 12 to 4100KB when copied from other OSDs to osd-04.
>>>
>>> Is this something to be expected? Is there any way to make it
>>> propagate the
>>> sparse file info? Or should we think about issuing a "fallocate
>>> -d"-like patch
>>> for writes on filestore?
>>>
>>> (We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
>>> remains; our XFS uses 4K bsize).
>>
>> I think the solution here is to use sparse_read during recovery.  The
>> PushOp data representation already supports it; it's just a matter of
>> skipping the zeros.  The recovery code could also have an option to
>> check
>> for fully-zero regions of the data and turn those into holes as
>> well.  For
>> ReplicatedBackend, see build_push_op().
>
> So far it turns out that there's even easier solution, we just enabled
> "filestore seek hole" on some test cluster and that seems to fix the
> problem for us. We'll see if fiemap works too.
>

Is it safe to enable "filestore seek hole", are there any tests that
verifies that everything related to RBD works fine with this enabled?
Can we make this enabled by default?

I tested on few of our production images and it seems that about 30% is
sparse. This will be lost on any cluster wide event (add/remove nodes,
PG grow, recovery).

How this is/will be handled in BlueStore?

(added ceph-users as it might interest others also).

-- 
PS
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html