[ Adding fsdevel, because this is not limited to gfs2 ] On Mon, May 31, 2021 at 12:56 AM Andreas Gruenbacher <agruenba@xxxxxxxxxx> wrote: > > Andreas Gruenbacher (2): > gfs2: Fix mmap locking for write faults This is bogus. I've pulled it, but this is just wrong. A write fault on a mmap IS NOT A WRITE to the filesystem. It's a read. Yes, it will later then allow writes to the page, but that's entirely immaterial - because the write is going to happen not at fault time, but long *after* the fault, and long *after* the filesystem has installed the page. The actual write will happen when the kernel returns from the user space. And no, the explanation in that commit makes no sense either: "When a write fault occurs, we need to take the inode glock of the underlying inode in exclusive mode. Otherwise, there's no guarantee that the dirty page will be written back to disk" the thing is, FAULT_FLAG_WRITE only says that the *currently* pending access that triggered the fault was a write. That's entirely irrelevant to a filesystem, because (a) it might be a private mapping, and a write to a page will cause a COW by the VM layer, and it's not actually a filesystem write at all AND (b) if it's a shared mapping, the first access that paged things in is likely a READ, and the page will be marked writable (because it's a shared mapping!) and subsequent writes will not cause any faults at all. In other words, a filesystem that checks for FAULT_FLAG_WRITE is _doubly_ wrong. It's absolutely never the right thing to do. It *cannot* be the right thing to do. And yes, some other filesystems do this crazy thing too. If your friends jumped off a bridge, would you jump too? The xfs and ext3/ext4 cases are wrong too - but at least they spent five seconds (but no more) thinking about it, and they added the check for VM_SHARED. So they are only wrong for reason (b) But wrong is wrong. The new code is not right in gfs2, and the old code in xfs/ext4 is not right either. Yeah, yeah, you can - and people do - do things like "always mark the page readable on initial page fault, use mkwrite to catch when it becomes writable, and do timestamps carefully, at at least have full knowledge of "something might become dirty" But even then it is ENTIRELY BOGUS to do things like write locking. Really. Because the actual write HASN'T HAPPENED YET, AND YOU WILL RELEASE THE LOCK BEFORE IT EVER DOES! So the lock? It does nothing. If you think it protects anything at all, you're wrong. So don't do write locking. At an absolute most, you can do things like - update file times (and honestly, that's quite questionable - because again - THE WRITE HASN'T HAPPENED YET - so any tests that depend on exact file access times to figure out when the last write was done is not being helped by your extra code, because you're setting the WRONG time. - set some "this inode will have dirty pages" flag just for some possible future use. But honestly, if this is about consistency etc, you need to do it not for a fault, but across the whole mmap/munmap. So some things may be less bogus - but still very very questionable. But locking? Bogus. Reads and writes aren't really any different from a timestamp standpoint (if you think you need to mtime for write accesses, you should do atime for reads, so from a filesystem timestamp standpoint read and write faults are exactly the same - and both are bogus, because by definition for a mmap both the reads and the writes can then happen long long long afterwards, and repeatedly). And if that "set some flag" thing needs a write lock, but a read doesn't, you're doing something wrong and odd. Look at the VM code. The VM code does this right. The mmap_sem is taken for writing for mmap and for munmap. But a page fault is always a read lock, even if the access that caused the page fault is a write. The actual real honest-to-goodness *write* happens much later, and the only time the filesystem really knows when it is done is at writeback time. Not at fault time. So if you take locks, logically you should take them when the fault happens, and release them when the writeback is done. Are you doing that? No. So don't do the write lock over the read portion of the page fault. It is not a sensible operation. Linus