On Thu, 6 Apr 2017, Piotr Dałek wrote: > On 04/06/2017 03:55 PM, Sage Weil wrote: > > On Thu, 6 Apr 2017, Piotr Dałek wrote: > > > On 04/06/2017 03:25 PM, Sage Weil wrote: > > > > On Thu, 6 Apr 2017, Piotr Dałek wrote: > > > > > Hello, > > > > > > > > > > We recently had an interesting issue with RBD images and filestore on > > > > > Jewel > > > > > 10.2.5: > > > > > We have a pool with RBD images, all of them mostly untouched (large > > > > > areas > > > > > of > > > > > those images unused), and once we added 3 new OSDs to cluster, objects > > > > > representing these images grew substantially on new OSDs: objects > > > > > hosting > > > > > unused areas of these images on original OSDs remained small (~8K of > > > > > space > > > > > actually used, 4M allocated), but on new OSDs were large (4M allocated > > > > > *and* > > > > > actually used). After investigation we concluded that Ceph didn't > > > > > propagate > > > > > sparse file information during cluster rebalance, resulting in correct > > > > > data > > > > > contents on all OSDs, but no sparse file data on new OSDs, hence disk > > > > > space > > > > > usage increase on those. > > > > > > > > > > [..] > > > > > > > > I think the solution here is to use sparse_read during recovery. The > > > > PushOp data representation already supports it; it's just a matter of > > > > skipping the zeros. The recovery code could also have an option to > > > > check > > > > for fully-zero regions of the data and turn those into holes as well. > > > > For > > > > ReplicatedBackend, see build_push_op(). > > > > > > Can we abuse that to reduce amount of regular (client/inter-osd) network > > > traffic? > > > > Yeah... I wouldn't call it abuse :). sparse_read() will use > > SEEK_HOLE/SEEK_DATA on filestore (if enabled). On bluestore we have the > > metadata on-hand. It may be a bit slower, though... more complexity > > and such. They recently implemented something like this for the kernel > > NFS server and found it was faster for very sparse files but the rest of > > the time it was a fair bit slower. > > I was wondering if we could modify regular reads in a way that makes them work > as it used to work, but not transmit zeroed out pages/blocks/objects (in other > words, you still would get bufferptrs full of zeroes, but they wouldn't be > transmitted as such over the wire; specialized case of RLE compression). That > shouldn't be so much slower. But I don't really see how that would work > without protocol change... Well, at least it's possible to replace some of > calls to read with sparse read, utilizing filesystem/file store metadata to do > heavy lifting for us. IIRC librbd used to have an option to do sparse-read all the time instead of read (I think this was in ObjectCacher somewhere?) but I think it got turned off for some reason? Memory is very fuzzy here. In any case, changing the client to use sparse-read is the way to do it, I think. I'm a bit skeptical that this will have much of an impact, though. sage