Darrick, > Could a SCSI device could advertise 512b LBAs, 4096b physical blocks, > a 64k atomic_write_unit_max, and a 1MB maximum transfer length > (atomic_write_max_bytes)? Yes. > And does that mean that application software can send one 64k-aligned > write and expect it either to be persisted completely or not at all? Yes. > And, does that mean that the application can send up to 16 of these > 64k-aligned blocks as a single 1MB IO and expect that each of those 16 > blocks will either be persisted entirely or not at all? Yes. > There doesn't seem to be any means for the device to report /which/ of > the 16 were persisted, which is disappointing. But maybe the > application encodes LSNs and can tell after the fact that something > went wrong, and recover? Correct. Although we traditionally haven't had too much fun with partial completion for sequential I/O either. > If the same device reports a 2048b atomic_write_unit_min, does that mean > that I can send between 2 and 64k of data as a single atomic write and > that's ok? I assume that this weird situation (512b LBA, 4k physical, > 2k atomic unit min) requires some fancy RMW but that the device is > prepared to cr^Wpersist that correctly? Yes. It would not make much sense for a device to report a minimum atomic granularity smaller than the reported physical block size. But in theory it could. > What if the device also advertises a 128k atomic_write_boundary? > That means that a 2k atomic block write will fail if it starts at 127k, > but if it starts at 126k then thats ok. Right? Correct. > As for avoiding splits in the block layer, I guess that also means that > someone needs to reduce atomic_write_unit_max and atomic_write_boundary > if (say) some sysadmin decides to create a raid0 of these devices with a > 32k stripe size? Correct. Atomic limits will need to be stacked for MD and DM like we do with the remaining queue limits. > It sounds like NVME is simpler in that it would report 64k for both the > max unit and the max transfer length? And for the 1M write I mentioned > above, the application must send 16 individual writes? Correct. > With my app developer hat on, the simplest mental model of this is that > if I want to persist a blob of data that is larger than one device LBA, > then atomic_write_unit_min <= blob size <= atomic_write_unit_max must be > true, and the LBA range for the write cannot cross a atomic_write_boundary. > > Does that sound right? Yep. > Going back to my sample device above, the XFS buffer cache could write > individual 4k filesystem metadata blocks using REQ_ATOMIC because 4k is > between the atomic write unit min/max, 4k metadata blocks will never > cross a 128k boundary, and we'd never have to worry about torn writes > in metadata ever again? Correct. > Furthermore, if I want to persist a bunch of blobs in a contiguous LBA > range and atomic_write_max_bytes > atomic_write_unit_max, then I can do > that with a single direct write? Yes. > I'm assuming that the blobs in the middle of the range must all be > exactly atomic_write_unit_max bytes in size? If you care about each blob being written atomically, yes. > And I had better be prepared to (I guess) re-read the entire range > after the system goes down to find out if any of them did or did not > persist? If you crash or get an I/O error, then yes. There is no way to inquire which blobs were written. Just like we don't know which LBAs were written if the OS crashes in the middle of a regular write operation. -- Martin K. Petersen Oracle Linux Engineering