On 2021-11-04 2:46 a.m., Chaitanya Kulkarni wrote:
From: Chaitanya Kulkarni <kch@xxxxxxxxxx>
Hi,
One of the responsibilities of the Operating System, along with managing
resources, is to provide a unified interface to the user by creating
hardware abstractions. In the Linux Kernel storage stack that
abstraction is created by implementing the generic request operations
such as REQ_OP_READ/REQ_OP_WRITE or REQ_OP_DISCARD/REQ_OP_WRITE_ZEROES,
etc that are mapped to the specific low-level hardware protocol commands
e.g. SCSI or NVMe.
With that in mind, this RFC patch-series implements a new block layer
operation to offload the data verification on to the controller if
supported or emulate the operation if not. The main advantage is to free
up the CPU and reduce the host link traffic since, for some devices,
their internal bandwidth is higher than the host link and offloading this
operation can improve the performance of the proactive error detection
applications such as file system level scrubbing.
* Background *
-----------------------------------------------------------------------
NVMe Specification provides a controller level Verify command [1] which
is similar to the ATA Verify [2] command where the controller is
responsible for data verification without transferring the data to the
host. (Offloading LBAs verification). This is designed to proactively
discover any data corruption issues when the device is free so that
applications can protect sensitive data and take corrective action
instead of waiting for failure to occur.
The NVMe Verify command is added in order to provide low level media
scrubbing and possibly moving the data to the right place in case it has
correctable media degradation. Also, this provides a way to enhance
file-system level scrubbing/checksum verification and optinally offload
this task, which is CPU intensive, to the kernel (when emulated), over
the fabric, and to the controller (when supported).
This is useful when the controller's internal bandwidth is higher than
the host's bandwith showing a sharp increase in the performance due to
_no host traffic or host CPU involvement_.
* Implementation *
-----------------------------------------------------------------------
Right now there is no generic interface which can be used by the
in-kernel components such as file-system or userspace application
(except passthru commands or some combination of write/read/compare) to
issue verify command with the central block layer API. This can lead to
each userspace applications having protocol specific IOCTL which
defeates the purpose of having the OS provide a H/W abstraction.
This patch series introduces a new block layer payloadless request
operation REQ_OP_VERIFY that allows in-kernel components & userspace
applications to verify the range of the LBAs by offloading checksum
scrubbing/verification to the controller that is directly attached to
the host. For direct attached devices this leads to decrease in the host
DMA traffic and CPU usage and for the fabrics attached device over the
network that leads to a decrease in the network traffic and CPU usage
for both host & target.
* Scope *
-----------------------------------------------------------------------
Please note this only covers the operating system level overhead.
Analyzing controller verify command performance for common protocols
(SCSI/NVMe) is out of scope for REQ_OP_VERIFY.
* Micro Benchmarks *
-----------------------------------------------------------------------
When verifing 500GB of data on NVMeOF with nvme-loop and null_blk as a
target backend block device results show almost a 80% performance
increase :-
With Verify resulting in REQ_OP_VERIFY to null_blk :-
real 2m3.773s
user 0m0.000s
sys 0m59.553s
With Emulation resulting in REQ_OP_READ null_blk :-
real 12m18.964s
user 0m0.002s
sys 1m15.666s
A detailed test log is included at the end of the cover letter.
Each of the following was tested:
1. Direct Attached REQ_OP_VERIFY.
2. Fabrics Attached REQ_OP_VERIFY.
3. Multi-device (md) REQ_OP_VERIFY.
* The complete picture *
-----------------------------------------------------------------------
For the completeness the whole kernel stack support is divided into
two phases :-
Phase I :-
Add and stabilize the support for the Block layer & low level drivers
such as SCSI, NVMe, MD, and NVMeOF, implement necessary emulations in
the block layer if needed and provide block level tools such as
_blkverify_. Also, add appropriate testcases for code-coverage.
Phase II :-
Add and stabilize the support for upper layer kernel components such
as file-systems and provide userspace tools such _fsverify_ to route
the request from file systems to block layer to Low level device
drivers.
Please note that the interfaces for blk-lib.c REQ_OP_VERIFY emulation
will change in future I put together for the scope of RFC.
Any comments are welcome.
Hi,
You may also want to consider higher level support for the NVME COMPARE
and SCSI VERIFY(BYTCHK=1) commands. Since PCIe and SAS transports are
full duplex, replacing two READs (plus a memcmp in host memory) with
one READ and one COMPARE may be a win on a bandwidth constrained
system. It is a safe to assume the data-in transfers on a storage transport
exceed (probably by a significant margin) the data-out transfers. An
offloaded COMPARE switches one of those data-in transfers to a data-out
transfer, so it should improve the bandwidth utilization.
I did some brief benchmarking on a NVME SSD's COMPARE command (its optional)
and the results were underwhelming. OTOH using my own dd variants (which
can do compare instead of copy) and a scsi_debug target (i.e. RAM) I have
seen compare times of > 15 GBps while a copy rarely exceeds 9 GBps.
BTW The SCSI VERIFY(BYTCHK=3) command compares one block sent from
the host with a sequence of logical blocks on the media. So, for example,
it would be a quick way of checking that a sequence of blocks contained
zero-ed data.
Doug Gilbert