On Thu, 6 Apr 2017, Piotr Dałek wrote: > On 04/06/2017 03:25 PM, Sage Weil wrote: > > On Thu, 6 Apr 2017, Piotr Dałek wrote: > > > Hello, > > > > > > We recently had an interesting issue with RBD images and filestore on > > > Jewel > > > 10.2.5: > > > We have a pool with RBD images, all of them mostly untouched (large areas > > > of > > > those images unused), and once we added 3 new OSDs to cluster, objects > > > representing these images grew substantially on new OSDs: objects hosting > > > unused areas of these images on original OSDs remained small (~8K of space > > > actually used, 4M allocated), but on new OSDs were large (4M allocated > > > *and* > > > actually used). After investigation we concluded that Ceph didn't > > > propagate > > > sparse file information during cluster rebalance, resulting in correct > > > data > > > contents on all OSDs, but no sparse file data on new OSDs, hence disk > > > space > > > usage increase on those. > > > > > > [..] > > > > I think the solution here is to use sparse_read during recovery. The > > PushOp data representation already supports it; it's just a matter of > > skipping the zeros. The recovery code could also have an option to check > > for fully-zero regions of the data and turn those into holes as well. For > > ReplicatedBackend, see build_push_op(). > > Can we abuse that to reduce amount of regular (client/inter-osd) network > traffic? Yeah... I wouldn't call it abuse :). sparse_read() will use SEEK_HOLE/SEEK_DATA on filestore (if enabled). On bluestore we have the metadata on-hand. It may be a bit slower, though... more complexity and such. They recently implemented something like this for the kernel NFS server and found it was faster for very sparse files but the rest of the time it was a fair bit slower. sage