On Sat, 06 Jul 2024, Christoph Hellwig wrote: > Btw, one issue with using direct I/O is that need to synchronize with > page cache access from the server itself. For pNFS we can do that as > we track outstanding layouts. Without layouts it will be more work > as we'll need a different data structure tracking grant for bypassing > the server. Or just piggy back on layouts anyway as that's what they > are doing. > I'm missing something here. Certainly if localio or nfsd were to choose to use direct I/O we would need to ensure proper synchronisation with page cache access. Does VFS/MM already provide enough synchronisation? A quick look at the code suggests: - before an O_DIRECT read any dirty pages that overlap are flushed to the device. - after a write, any pages that overlap are invalidated. So as long as IO requests don't overlap we should have adequate synchronisation. If they do overlap we should expect inconsistent results. Maybe we would expect reads to only "tear" on a page boundary, and writes to only interleave in whole pages, and probably using O_DIRECT would not give any whole-page guarantees. So maybe that is a problem. If it is a problem, I think it can only be fixed by keeping track of which pages are under direct IO, and preventing access to the page-cache for those regions. This could be done in the page-cache itself, or in a separte extent-tree. I don't think the VFS/MM supports this - does any filesystem? (or we could prevent adding any new pages to the page-cache for an inode with i_dio_count > 0 - but that would likely hurt performance.) I can see that pNFS extents could encode the information to enforce this, but I don't see how that is mapped to filesystems in Linux at present. What am I missing? Thanks, NeilBrown