Re: [RFC 0/2] rbd: respect REQ_NOUNMAP by setting new nounmap flag for ZERO op

Jason Dillaman <jdillama@xxxxxxxxxx> · Mon, 21 Jan 2019 08:58:22 -0500

On Mon, Jan 21, 2019 at 5:23 AM Roman Penyaev <rpenyaev@xxxxxxx> wrote:
>
> Hi Ilya,
>
> On 2019-01-18 17:29, Ilya Dryomov wrote:
> > On Fri, Jan 18, 2019 at 3:56 PM Roman Penyaev <rpenyaev@xxxxxxx> wrote:
> >>
> >> Hi all,
> >>
> >> This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd
> >> side
> >> when REQ_NOUNMAP flag is set for a block layer request.
> >
> > Hi Roman,
> >
> > I'm working on splitting DISCARD and WRITE_ZEROES handling right now.
> > The idea is to punt on small and/or unaligned discard requests which
> > don't actually free up any space but translate into a RADOS zero op.
> > I'm not changing how WRITE_ZEROES is implemented though, so this is
> > orthogonal to your work -- just wanted to give a heads up.
>
> Good to know, thanks for telling me.
>
> >> Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer
> >> requests
> >> fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd
> >> side.
> >>
> >> With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO
> >> request
> >> osd can zero out blocks, instead of punching holes.
> >
> > REQ_NOUNMAP is just a hint, the block device is free to ignore it.
> > IIRC the only way to control it from userspace is through fallocate(2):
> > FALLOC_FL_PUNCH_HOLE can unmap, while FALLOC_FL_ZERO_RANGE is supposed
> > to not unmap.  Given that fallocate(2) on block devices is fairly new,
> > I'm curious if you have an application that actually cares in mind?
>
> No, no.  This is an attempt to follow block layer semantics, nothing
> more.
> Indeed, the users of REQ_NONUMAP are ioctl() and fallocate(), so the
> only
> practical value which comes to mind is performance (preallocate zeroed
> blocks and format any fs, etc) and possible secure-erase.  After some
> internal discussions about performance of writing zeroes (instead of
> true DISCARD) this seems does not bring any value, at least on
> bluestore,
> but secure wipe can make sense (for example using blkdiscard --zerouut).

The zeroed writes would need to be smaller than the bluestore min
alloc size for that to work. Otherwise, bluestore will just allocate a
new blob extent, write zeroes to it, and pivot the object metadata to
point to the new allocation.

> --
> Roman
>

-- 
Jason