On Tue, Feb 28, 2023 at 10:52:15PM -0500, Theodore Ts'o wrote: > Emulated block devices offered by cloud VM’s can provide functionality > to guest kernels and applications that traditionally have not been > available to users of consumer-grade HDD and SSD’s. For example, > today it’s possible to create a block device in Google’s Persistent > Disk with a 16k physical sector size, which promises that aligned 16k > writes will be atomically. With NVMe, it is possible for a storage > device to promise this without requiring read-modify-write updates for > sub-16k writes. All that is necessary are some changes in the block > layer so that the kernel does not inadvertently tear a write request > when splitting a bio because it is too large (perhaps because it got > merged with some other request, and then it gets split at an > inconvenient boundary). Now that we've flung ourselves into the wild world of Software Defined Secure Storage as a Service*, I was thinking -- T10 PI gives the kernel a means to associate its own checksums (and a goofy u16 tag) with LBAs on disk. There haven't been that many actual SCSI devices that implement it, but I wonder how hard it would be for clod storage backends to export things like that? The storage nodes often have a bit more CPU power, too. Though admittedly the advent of customer-managed FDE in the cloud and might make that less useful? Just my random 2c late at night, --D * SDSSAAS: what you get from banging head on keyboard in frustration > There are also more interesting, advanced optimizations that might be > possible. For example, Jens had observed the passing hints that > journaling writes (either from file systems or databases) could be > potentially useful. Unfortunately most common storage devices have > not supported write hints, and support for write hints were ripped out > last year. That can be easily reversed, but there are some other > interesting related subjects that are very much suited for LSF/MM. > > For example, most cloud storage devices are doing read-ahead to try to > anticipate read requests from the VM. This can interfere with the > read-ahead being done by the guest kernel. So being able to tell > cloud storage device whether a particular read request is stemming > from a read-ahead or not. At the moment, as Matthew Wilcox has > pointed out, we currently use the read-ahead code path for synchronous > buffered reads. So plumbing this information so it can passed through > multiple levels of the mm, fs, and block layers will probably be > needed.