Re: [Lsf-pc] [LSF/MM/BPF ATTEND][LSF/MM/BPF TOPIC] Meta/Integrity/PI improvements

Dongyang Li <dongyangli@xxxxxxx> · Tue, 2 Apr 2024 10:45:57 +0000

Martin, Kanchan,
> 
> Kanchan,
> 
> > - Generic user interface that user-space can use to exchange meta.
> > A
> > new io_uring opcode IORING_OP_READ/WRITE_META - seems feasible for
> > direct IO.
> 
> Yep. I'm interested in this too. Reviving this effort is near the top
> of
> my todo list so I'm happy to collaborate.
If we are going to have a interface to exchange meta/integrity to user-
space, we could also have a interface in kernel to do the same?

It would be useful for some network filesystem/block device drivers
like nbd/drbd/NVMe-oF to use blk-integrity as network checksum, and the
same checksum covers the I/O on the server as well.

The integrity can be generated on the client and send over network,
on server blk-integrity can just offload to storage.
Verify follows the same principle: on server blk-integrity gets
the PI from storage using the interface, and send over network,
on client we can do the usual verify.

In the past we tried to achieve this, there's patch to add optional
generate/verify functions and they take priority over the ones from the
integrity profile, and the optional generate/verify functions does the
meta/PI exchange, but that didn't get traction. It would be much better
if we can have an bio interface for this.

Cheers
Dongyang
> 
> > NVMe SSD can do the offload when the host sends the PRACT bit. But
> > in
> > the driver, this is tied to global integrity disablement using
> > CONFIG_BLK_DEV_INTEGRITY.
> 
> > So, the idea is to introduce a bio flag REQ_INTEGRITY_OFFLOAD
> > that the filesystem can send. The block-integrity and NVMe driver
> > do
> > the rest to make the offload work.
> 
> Whether to have a block device do this is currently controlled by the
> /sys/block/foo/integrity/{read_verify,write_generate} knobs. At least
> for SCSI, protected transfers are always enabled between HBA and
> target
> if both support it. If no integrity has been attached to an I/O by
> the
> application/filesystem, the block layer will do so controlled by the
> sysfs knobs above. IOW, if the hardware is capable, protected
> transfers
> should always be enabled, at least from the block layer down.
> 
> It's possible that things don't work quite that way with NVMe since,
> at
> least for PCIe, the drive is both initiator and target. And NVMe also
> missed quite a few DIX details in its PI implementation. It's been a
> while since I messed with PI on NVMe, I'll have a look.
> 
> But in any case the intent for the Linux code was for protected
> transfers to be enabled automatically when possible. If the block
> layer
> protection is explicitly disabled, a filesystem can still trigger
> protected transfers via the bip flags. So that capability should
> definitely be exposed via io_uring.
> 
> > "Work is in progress to implement support for the data integrity
> > extensions in btrfs, enabling the filesystem to use the application
> > tag."
> 
> This didn't go anywhere for a couple of reasons:
> 
>  - Individual disk drives supported ATO but every storage array we
>    worked with used the app tag space internally. And thus there were
>    very few real-life situations where it would be possible to store
>    additional information in each block.
> 
>    Back in the mid-2000s, putting enterprise data on individual disk
>    drives was not considered acceptable. So implementing filesystem
>    support that would only be usable on individual disk drives didn't
>    seem worth the investment. Especially when the PI-for-ATA efforts
>    were abandoned.
> 
>    Wrt. the app tag ownership situation in SCSI, the storage tag in
> NVMe
>    spec is a remedy for this, allowing the application to own part of
>    the extra tag space and the storage device itself another.
> 
>  - Our proposed use case for the app tag was to provide filesystems
> with
>    back pointers without having to change the on-disk format.
> 
>    The use of 0xFFFF as escape check in PI meant that the caller had
> to
>    be very careful about what to store in the app tag. Our prototype
>    attached structs of metadata to each filesystem block (8 512-byte
>    sectors * 2 bytes of PI, so 16 bytes of metadata per filesystem
>    block). But none of those 2-byte blobs could contain the value
>    0xFFFF. Wasn't really a great interface for filesystems that
> wanted
>    to be able to attach whatever data structure was important to
> them.
> 
> So between a very limited selection of hardware actually providing
> the
> app tag space and a clunky interface for filesystems, the app tag
> just
> never really took off. We ended up modifying it to be an access
> control
> instead, see the app tag control mode page in SCSI.
> 
> Databases and many filesystems have means to protect blocks or
> extents.
> And these means are often better at identifying the nature of read-
> time
> problems than a CRC over each 512-byte LBA would be. So what made PI
> interesting was the ability to catch problems at write time in case
> of a
> bad partition remap, wrong buffer pointer, misordered blocks, etc.
> Once
> the data is on media, the drive ECC is superior. And again, at read
> time
> the database or application is often better equipped to identify
> corruption than PI.
> 
> And consequently our interest focused on treating PI something more
> akin
> to a network checksum than a facility to protect data at rest on
> media.
>