On Sun, 25 Feb 2024 at 18:49, Al Viro <viro@xxxxxxxxxx> wrote: > > Uh? __generic_file_write_iter() doesn't take inode lock, but > generic_file_write_iter() does. Yes, as I replied to Kent later, I mis-remembered the details (together with a too-quick grep). It's the reading side that just doesn't care, and which causes us to not actually give the POSIX guarantees anyway. > O_APPEND handling aside, there's > also file_remove_privs() in there and it does need that lock. Fair enough, but that's such a rare special case that it really doesn't merit the lock on every write. What at least the ext4 the DIO code does is something like "look at the write position and the inode size, and decide optimistically whether to take the lock shared or not". Then it re-checks it after potentially taking it for a read (in case the inode size has changed), and might decide to go for a write lock after all.. And I think we could fairly trivially do something similar on the regular write side. Yes, it needs to check SUID/SGID bits in addition to the inode size change, I guess (I don't think the ext4 dio code does, but my quick grepping might once again be incomplete). Anyway, DaveC is obviously also right that for the "actually need to do writeback" case, our writeback is currently intentionally throttled, which is why the benchmark by Luis shows that "almost two orders of magnitude" slowdown with buffered writeback. That's probably mainly an effect of having a backing store with no delays, but still.. However, the reason I dislike our write-side locking is that it actually serializes even the entirely cached case. Now, writing concurrently to the same inode is obviously strange, but it's a common thing for databases. And while all the *serious* ones use DIO, I think the buffered case really should do better. Willy - tangential side note: I looked closer at the issue that you reported (indirectly) with the small reads during heavy write activity. Our _reading_ side is very optimized and has none of the write-side oddities that I can see, and we just have filemap_read -> filemap_get_pages -> filemap_get_read_batch -> folio_try_get_rcu() and there is no page locking or other locking involved (assuming the page is cached and marked uptodate etc, of course). So afaik, it really is just that *one* atomic access (and the matching page ref decrement afterwards). We could easily do all of this without getting any ref to the page at all if we did the page cache release with RCU (and the user copy with "copy_to_user_atomic()"). Honestly, anything else looks like a complete disaster. For tiny reads, a temporary buffer sounds ok, but really *only* for tiny reads where we could have that buffer on the stack. Are tiny reads (handwaving: 100 bytes or less) really worth optimizing for to that degree? In contrast, the RCU-delaying of the page cache might be a good idea in general. We've had other situations where that would have been nice. The main worry would be low-memory situations, I suspect. The "tiny read" optimization smells like a benchmark thing to me. Even with the cacheline possibly bouncing, the system call overhead for tiny reads (particularly with all the mitigations) should be orders of magnitude higher than two atomic accesses. Linus