On Thu, Mar 02, 2023 at 09:54:59PM -0500, Martin K. Petersen wrote: > > Hi Ted! > > > With NVMe, it is possible for a storage device to promise this without > > requiring read-modify-write updates for sub-16k writes. All that is > > necessary are some changes in the block layer so that the kernel does > > not inadvertently tear a write request when splitting a bio because it > > is too large (perhaps because it got merged with some other request, > > and then it gets split at an inconvenient boundary). > > We have been working on support for atomic writes and it is not a simple > as it sounds. Atomic operations in SCSI and NVMe have semantic > differences which are challenging to reconcile. On top of that, both the > SCSI and NVMe specs are buggy in the atomics department. We are working > to get things fixed in both standards and aim to discuss our > implementation at LSF/MM. I'd be very interested to learn more about what you've found. I know more than one cloud provider is thinking about how to use the NVMe spec to send information about how their emulated block device work. This has come up at our weekly ext4 video conference, and given that I gave a talk about it in 2018[1], there's quite a lot of similarity of what folks are thinking about. Basically, MySQL and Postgres use 16k database pages, and if we can avoid their special doublewrite techniques to avoid torn writes, because they can depend on their Cloud Block Devices Working A Certain Way, it can make for very noticeable performance improvements. [1] https://www.youtube.com/watch?v=gIeuiGg-_iw So while the standards might allow standards-compliant physical devices to do some really wierd sh*t, it might be that if all cloud vendors do things in the same way, I could see various cloud workloads starting to depending on extra-standard behaviour, much like a lot of sysadmins assume that low-numbered LBA's are on the outer diamenter of the HDD and are much more performant than sectors on the i.d. of the HDD. This is completely not guaranteed by the standard specs, but it's become a defacto standard. That's not a great place to be, and it would be great if can find ways that are much more reliable in terms of querying a standards-compliant storage device and knowing whether we can depend on a certain behavior --- but I also know how slowly storage standards bodies move. :-( > Hinting didn't see widespread adoption because we in Linux, as well as > the various interested databases, preferred hints to be per-I/O > properties. Whereas $OTHER_OS insisted that hints should be statically > assigned to LBA ranges on media. This left vendors having to choose > between two very different approaches and consequently they chose not to > support any of them. I wasn't aware of that history. Thanks for filling that bit in. Fortunately, in 2023, it appears that for many cloud vendors, the teams involved care a lot more about Linux than $OTHER_OS. So hopefully we'll have a lot more success in getting write hints generally available to hyperscale cloud customers. >From an industry-wide perspective, it would be useful if the write hints used by Hyperscale Cloud Vendor #1 are very similar to what write hints are supported by Hyperscale Cloud Vendor #2. Standards committees aren't the only way that companies can collaborate in an anti-trust compliant way. Open source is another way; and especially if we can show that a set of hints work well for the Linux kernel and Linux applications ---- then what we ship in the Linux kernel can help shape the set of "write hints" that cloud storage devices will support. - Ted P.S. From a LSF/MM program perspective, I suspect we may want to have more than one session; one that is focused on standards and atomic writes, and another that is focused on write hints. The first might be mostly block and fs focused, and the second would probably be of interest to mm folks as well.