On 2/23/2024 1:38 AM, Keith Busch wrote: > On Fri, Feb 23, 2024 at 01:03:01AM +0530, Kanchan Joshi wrote: >> With respect to the current state of Meta/Block-integrity, there are >> some missing pieces. >> I can improve some of it. But not sure if I am up to speed on the >> history behind the status quo. >> >> Hence, this proposal to discuss the pieces. >> >> Maybe people would like to discuss other points too, but I have the >> following: >> >> - Generic user interface that user-space can use to exchange meta. A >> new io_uring opcode IORING_OP_READ/WRITE_META - seems feasible for >> direct IO. Buffered IO seems non-trivial as a relatively smaller meta >> needs to be written into/read from the page cache. The related >> metadata must also be written during the writeback (of data). >> >> >> - Is there interest in filesystem leveraging the integrity capabilities >> that almost every enterprise SSD has. >> Filesystems lacking checksumming abilities can still ask the SSD to do >> it and be more robust. >> And for BTRFS - there may be value in offloading the checksum to SSD. >> Either to save the host CPU or to get more usable space (by not >> writing the checksum tree). The mount option 'nodatasum' can turn off >> the data checksumming, but more needs to be done to make the offload >> work. > > As I understand it, btrfs's checksums are on a variable extent size, but > offloading it to the SSD would do it per block, so it's forcing a new > on-disk format. It would be cool to use it, though: you could atomically > update data and checksums without stable pages. > Yes, variable extents but it computes the checksum for each FS block size (4k-64K, practically 4K) within each extent. On-disk format change will not be needed, because in this approach FS (and block-integrity) does not really deal with checksums. It only asks the device to compute/verify. Am I missing your point? >> NVMe SSD can do the offload when the host sends the PRACT bit. But in >> the driver, this is tied to global integrity disablement using >> CONFIG_BLK_DEV_INTEGRITY. >> So, the idea is to introduce a bio flag REQ_INTEGRITY_OFFLOAD >> that the filesystem can send. The block-integrity and NVMe driver do >> the rest to make the offload work. >>