Re: [PATCH 0/4] block: Add support for REQ_OP_ASSIGN_RANGE

"Martin K. Petersen" <martin.petersen@xxxxxxxxxx> · Thu, 02 Apr 2020 21:34:43 -0400

Hi Dave!

> Ok, so ext4 has a very limited max allocation size for an extent, so
> I expect this won't cause huge latency problems. However, what
> happens when we use XFS, have a 64kB block size, and fallocate() is
> allocating disk space in continguous 100GB extents and passing those
> down to the block device?

Depends on the device.

> How does this get split by dm devices? Are raid stripes going to dice
> this into separate stripe unit sized bios, so instead of single large
> requests we end up with hundreds or thousands or tiny allocation
> requests being issued?

There is nothing special about this operation. It needs to be handled
the same way as all other splits. I.e. ideally coalesced at the bottom
of the stack so we can issue larger, contiguous commands to the
hardware.

> How are we expecting hardware to behave here? Is this a queued
> command in the scsi/nvme/sata protocols? Or is this, for the moment,
> just a special snowflake that we can't actually use in production
> because the hardware just can't handle what we throw at it?

For now it's SCSI and queued. Only found in high-end thinly provisioned
storage arrays and not in your average SSD.

The performance expectation for REQ_OP_ALLOCATE is that it is faster
than a write to the same block range since the device potentially needs
to do less work. I.e. the device simply needs to decrement the free
space and mark the LBAs reserved in a map. It doesn't need to write all
the blocks to zero them. If you want zeroed blocks, use
REQ_OP_WRITE_ZEROES.

> IOWs, what sort of latency issues is this operation going to cause
> on real hardware? Is this going to be like discard? i.e. where we
> end up not using it at all because so few devices actually handle
> the massive stream of operations the filesystem will end up sending
> the device(s) in the course of normal operations?

The intended use case, from a SCSI perspective, is that on a thinly
provisioned device you can use this operation to preallocate blocks so
that future writes to the LBAs in question will not fail due to the
device being out of space. I.e. you would use this to pin down block
ranges where you can not tolerate write failures. The advantage over
writing the blocks individually is that dedup won't apply and that the
device doesn't actually have to go write all the individual blocks.

-- 
Martin K. Petersen	Oracle Linux Engineering