Re: [PATCHv10 0/9] write hints with nvme fdp, scsi streams

Bart Van Assche <bvanassche@xxxxxxx> · Wed, 27 Nov 2024 10:42:34 -0800

On 11/26/24 6:54 PM, Martin K. Petersen wrote:
Bart wrote:
There are some strong arguments in this thread from May 2024 in favor of
representing the entire copy operation as a single REQ_OP_ operation:
https://lore.kernel.org/linux-block/20240520102033.9361-1-nj.shetty@xxxxxxxxxxx/

As has been discussed many times, a copy operation is semantically a
read operation followed by a write operation. And, based on my
experience implementing support for both types of copy offload in Linux,
what made things elegant was treating the operation as a read followed
by a write throughout the stack. Exactly like the token-based offload
specification describes.

Submitting a copy operation as two bios or two requests means that there 
is a risk that one of the two operations never reaches the block driver
at the bottom of the storage stack and hence that a deadlock occurs. I
prefer not to introduce any mechanisms that can cause a deadlock.

As one can see here, Damien Le Moal and Keith Busch both prefer to
submit copy operations as a single operation: Keith Busch, Re: [PATCH
v20 02/12] Add infrastructure for copy offload in block and request
layer, linux-block mailing list, 2024-06-24 
(https://lore.kernel.org/all/Znn6C-C73Tps3WJk@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/).

Token-based copy offloading (called ODX by Microsoft) could be
implemented by maintaining a state machine in the SCSI sd driver

I suspect the SCSI maintainer would object strongly to the idea of
maintaining cross-device copy offload state and associated object
lifetime issues in the sd driver.

Such information wouldn't have to be maintained inside the sd driver. A
new kernel module could be introduced that tracks the state of copy
operations and that interacts with the sd driver.

I'm assuming that the IMMED bit will be set to zero in the WRITE USING
TOKEN command. Otherwise one or more additional RECEIVE ROD TOKEN
INFORMATION commands would be required to poll for the WRITE USING TOKEN
completion status.

What would the benefit of making WRITE USING TOKEN be a background
operation? That seems like a completely unnecessary complication.

If a single copy operation takes significantly more time than the time
required to switch between power states, power can be saved by using
IMMED=1. Mechanisms like run-time power management (RPM) or the UFS host
controller auto-hibernation mechanism can only be activated if no
commands are in progress. With IMMED=0, the link between the host and
the storage device will remain powered as long as the copy operation is
in progress. With IMMED=1, the link between the host and the storage
device can be powered down after the copy operation has been submitted
until the host decides to check whether or not the copy operation has
completed.

I guess that the block layer maintainer wouldn't be happy if all block
drivers would have to deal with three or four phases for copy
offloading just because ODX is this complicated.

Last I looked, EXTENDED COPY consumed something like 70 pages in the
spec. Token-based copy is trivially simple and elegant by comparison.

I don't know of any storage device vendor who has implemented all
EXTENDED COPY features that have been standardized. Assuming that 50
lines of code fit on a single page, here is an example of an EXTENDED
COPY implementation that can be printed on 21 pages of paper:
$ wc -l drivers/target/target_core_xcopy.c
 1041
$ echo $(((1041 + 49) / 50))
21

The advantages of EXTENDED COPY over ODX are as follows:
- EXTENDED COPY is a single SCSI command and hence better suited for
  devices with a limited queue depth. While the UFS 3.0 standard
  restricts the queue depth to 32, most UFS 4.0 devices support a
  queue depth of 64.
- The latency of setting up a copy command with EXTENDED COPY is
  lower since only a single command has to be sent to the device.
  ODX requires three round-trips to the device (assuming IMMED=0).
- EXTENDED COPY requires less memory in storage devices. Each ODX
  token occupies some memory and the rules around token lifetimes
  are nontrivial.

Thanks,

Bart.