> On Dec 22, 2023, at 7:06 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Fri, Dec 22, 2023 at 08:10:54AM -0700, Keith Busch wrote: >> If the host really wants to write in small granularities, then larger >> block sizes just shifts the write amplification from the device to the >> host, which seems worse than letting the device deal with it. > > Maybe? I'm never sure about that. See, if the drive is actually > managing the flash in 16kB chunks internally, then the drive has to do a > RMW which is increased latency over the host just doing a 16kB write, > which can go straight to flash. Assuming the host has the whole 16kB in > memory (likely?) Of course, if you're PCIe bandwidth limited, then a > 4kB write looks more attractive, but generally I think drives tend to > be IOPS limited not bandwidth limited today? > Fundamentally, if storage device supports 16K physical sector size, then I am not sure that we can write by 4K I/O requests. It means that we should read 16K LBA into page cache or application’s buffer before any write operation. So, I see potential RMW inside of storage device only if device is capable to manage 4K I/O requests even if physical sector is 16K. But is it real life use-case? I am not sure about attractiveness of 4K write operations. Usually, file system provides the way to configure an internal logical block size and metadata granularities. Finally, it is possible to align the internal metadata and user data granularities on 16K size, for example. An if we are talking about metadata structures (for example, inodes table, block mapping, etc), then it’s frequently updated data. So, 16K will most probably contains several updated 4K pieces. And, as a result, we have to flush all these updated metadata, anyway, despite PCIe bandwidth limitation (even if we have some). Also, I assume that to send 16K I/O request could be more beneficial that several 4K I/O requests. Of course, real life is more complicated. Thanks, Slava.