On Sun, 25 Feb 2024 at 17:58, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote: > > According to my reading just now, ext4 and btrfs (as well as bcachefs) > also don't take the inode lock in the read path - xfs is the only one > that does. Yeah, I should have remembered that detail - DaveC has pointed it out at some point how other filesystems don't actually honor the whole "all or nothing visible to read". And I was actually wrong about the common cases like ext2 - they use generic_file_write_iter(), which does take that inode lock, and I was confused with generic_perform_write() (which does not). It was always the read side that didn't care, as you point out. It's been some time since I looked at that. But as mentioned, nobody has actually ever shown any real interest in caring about the lack of POSIX technicality. > I think write vs. write consistency is the more interesting case; the > question there is does falling back to the inode lock when we can't lock > all the folios simultaneously work. I really don't think the write-write consistency is all that interesting either, and it really does hurt. If you're some toy database that would love to use buffered writes on just a DB file, that "no concurrent writes" can hurt a lot. So then people say "use DIO", but that has its own issues... There is one obvious special case, and I think it's the primary one why we end up having that inode_lock: O_APPEND or any other write extending the size of the file. THAT one obviously has to work right, and that's the case when multiple writers actually do want to get write-write consistency, and where it makes total sense to serialize them all. That's the one case that even DIO cares about. In the other cases, it's hard to argue that "one or the other wins the whole range" is seriously hugely better than "one or the other wins at some granularity". What the hell are you doing overlapping write ranges for if you have a "one or the other" mentality? Of course, maybe in practice it would be fine to do the "lock all the folios, with the fallback being the inode lock" - and we could even start with "all" being a pretty small number (perhaps starting with "one" ;^). Linus