On 2023/3/1 11:52, Theodore Ts'o wrote:
Emulated block devices offered by cloud VM’s can provide functionality to guest kernels and applications that traditionally have not been available to users of consumer-grade HDD and SSD’s. For example, today it’s possible to create a block device in Google’s Persistent Disk with a 16k physical sector size, which promises that aligned 16k writes will be atomically. With NVMe, it is possible for a storage device to promise this without requiring read-modify-write updates for sub-16k writes. All that is necessary are some changes in the block layer so that the kernel does not inadvertently tear a write request when splitting a bio because it is too large (perhaps because it got merged with some other request, and then it gets split at an inconvenient boundary).
Yeah, most cloud vendors (including Alibaba Cloud) now use ext4 bigalloc to avoid mysql double write buffers. In addition to improve performance, this method can also minimize unnecessary I/O traffic between computing and storage nodes. Once I hacked a COW-based in-house approach in XFS by using the optimized always_cow with some tricks to avoid storage dependency. But nowadays AWS and Google Cloud are all using ext4 bigalloc, so.. ;-)
There are also more interesting, advanced optimizations that might be possible. For example, Jens had observed the passing hints that journaling writes (either from file systems or databases) could be potentially useful. Unfortunately most common storage devices have not supported write hints, and support for write hints were ripped out last year. That can be easily reversed, but there are some other interesting related subjects that are very much suited for LSF/MM. For example, most cloud storage devices are doing read-ahead to try to anticipate read requests from the VM. This can interfere with the read-ahead being done by the guest kernel. So being able to tell cloud storage device whether a particular read request is stemming from a read-ahead or not. At the moment, as Matthew Wilcox has pointed out, we currently use the read-ahead code path for synchronous buffered reads. So plumbing this information so it can passed through multiple levels of the mm, fs, and block layers will probably be needed.
It seems that is also useful as well, yet if my understanding is correct, it's somewhat unclear for me if we could do more and have a better form compared with the current REQ_RAHEAD (currently REQ_RAHEAD use cases and impacts are quite limited.) Thanks, Gao Xiang