Re: [RFC 0/2] rbd: respect REQ_NOUNMAP by setting new nounmap flag for ZERO op

Roman Penyaev <rpenyaev@xxxxxxx> · Mon, 21 Jan 2019 15:36:51 +0100

On 2019-01-21 14:58, Jason Dillaman wrote:
On Mon, Jan 21, 2019 at 5:23 AM Roman Penyaev <rpenyaev@xxxxxxx> wrote:

Hi Ilya,

On 2019-01-18 17:29, Ilya Dryomov wrote:
> On Fri, Jan 18, 2019 at 3:56 PM Roman Penyaev <rpenyaev@xxxxxxx> wrote:
>>
>> Hi all,
>>
>> This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd
>> side
>> when REQ_NOUNMAP flag is set for a block layer request.
>
> Hi Roman,
>
> I'm working on splitting DISCARD and WRITE_ZEROES handling right now.
> The idea is to punt on small and/or unaligned discard requests which
> don't actually free up any space but translate into a RADOS zero op.
> I'm not changing how WRITE_ZEROES is implemented though, so this is
> orthogonal to your work -- just wanted to give a heads up.

Good to know, thanks for telling me.

>> Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer
>> requests
>> fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd
>> side.
>>
>> With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO
>> request
>> osd can zero out blocks, instead of punching holes.
>
> REQ_NOUNMAP is just a hint, the block device is free to ignore it.
> IIRC the only way to control it from userspace is through fallocate(2):
> FALLOC_FL_PUNCH_HOLE can unmap, while FALLOC_FL_ZERO_RANGE is supposed
> to not unmap.  Given that fallocate(2) on block devices is fairly new,
> I'm curious if you have an application that actually cares in mind?

No, no.  This is an attempt to follow block layer semantics, nothing
more.
Indeed, the users of REQ_NONUMAP are ioctl() and fallocate(), so the
only
practical value which comes to mind is performance (preallocate zeroed
blocks and format any fs, etc) and possible secure-erase.  After some
internal discussions about performance of writing zeroes (instead of
true DISCARD) this seems does not bring any value, at least on
bluestore,
but secure wipe can make sense (for example using blkdiscard 
--zerouut).

The zeroed writes would need to be smaller than the bluestore min
alloc size for that to work. Otherwise, bluestore will just allocate a
new blob extent, write zeroes to it, and pivot the object metadata to
point to the new allocation.

Exactly, that what I've heard.  Thanks for clarifying.

--
Roman