Re: Sparse file info in filestore not propagated to other OSDs

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 6 Apr 2017 13:25:48 +0000 (UTC)

On Thu, 6 Apr 2017, Piotr Dałek wrote:
> Hello,
> 
> We recently had an interesting issue with RBD images and filestore on Jewel
> 10.2.5:
> We have a pool with RBD images, all of them mostly untouched (large areas of
> those images unused), and once we added 3 new OSDs to cluster, objects
> representing these images grew substantially on new OSDs: objects hosting
> unused areas of these images on original OSDs remained small (~8K of space
> actually used, 4M allocated), but on new OSDs were large (4M allocated *and*
> actually used). After investigation we concluded that Ceph didn't propagate
> sparse file information during cluster rebalance, resulting in correct data
> contents on all OSDs, but no sparse file data on new OSDs, hence disk space
> usage increase on those.
> 
> Example on test cluster, before growing it by one OSD:
> 
> ls:
> 
> osd-01-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> 
> du:
> 
> osd-01-cluster: 12
> /var/lib/ceph/osd-01-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-02-cluster: 12
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: 12
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> 
> 
> mon-01-cluster:~ # rbd diff test
> Offset   Length  Type
> 8388608  4194304 data
> 16777216 4096    data
> 33554432 4194304 data
> 37748736 2048    data
> 
> And after growing it:
> 
> ls:
> 
> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' -exec
> ls -l {} \+
> osd-02-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:18
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-04-cluster: -rw-r--r-- 1 root root 4194304 Apr  6 09:25
> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> 
> du:
> 
> clush> find /var/lib/ceph/osd-*/current/0.*head/ -type f -name '*data*' -exec
> du -k {} \+
> osd-02-cluster: 12
> /var/lib/ceph/osd-02-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-03-cluster: 12
> /var/lib/ceph/osd-03-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> osd-04-cluster: 4100
> /var/lib/ceph/osd-04-cluster/current/0.27_head/rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0
> 
> Note that "rbd\udata.12a474b0dc51.0000000000000008__head_2DD64767__0" grew
> from 12 to 4100KB when copied from other OSDs to osd-04.
> 
> Is this something to be expected? Is there any way to make it propagate the
> sparse file info? Or should we think about issuing a "fallocate -d"-like patch
> for writes on filestore?
> 
> (We're using kernel 3.13.0-45-generic but on 4.4.0-31-generic the issue
> remains; our XFS uses 4K bsize).

I think the solution here is to use sparse_read during recovery.  The 
PushOp data representation already supports it; it's just a matter of 
skipping the zeros.  The recovery code could also have an option to check 
for fully-zero regions of the data and turn those into holes as well.  For 
ReplicatedBackend, see build_push_op().

sage