On Mon, May 31, 2021 at 6:30 PM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > [ Adding fsdevel, because this is not limited to gfs2 ] > > On Mon, May 31, 2021 at 12:56 AM Andreas Gruenbacher > <agruenba@xxxxxxxxxx> wrote: > > > > Andreas Gruenbacher (2): > > gfs2: Fix mmap locking for write faults > > This is bogus. > > I've pulled it, but this is just wrong. > > A write fault on a mmap IS NOT A WRITE to the filesystem. > > It's a read. > > Yes, it will later then allow writes to the page, but that's entirely > immaterial - because the write is going to happen not at fault time, > but long *after* the fault, and long *after* the filesystem has > installed the page. > > The actual write will happen when the kernel returns from the user space. > > And no, the explanation in that commit makes no sense either: > > "When a write fault occurs, we need to take the inode glock of the underlying > inode in exclusive mode. Otherwise, there's no guarantee that the > dirty page > will be written back to disk" > > the thing is, FAULT_FLAG_WRITE only says that the *currently* pending > access that triggered the fault was a write. That's entirely > irrelevant to a filesystem, because > > (a) it might be a private mapping, and a write to a page will cause a > COW by the VM layer, and it's not actually a filesystem write at all > > AND > > (b) if it's a shared mapping, the first access that paged things in > is likely a READ, and the page will be marked writable (because it's a > shared mapping!) and subsequent writes will not cause any faults at > all. > > In other words, a filesystem that checks for FAULT_FLAG_WRITE is > _doubly_ wrong. It's absolutely never the right thing to do. It > *cannot* be the right thing to do. > > And yes, some other filesystems do this crazy thing too. If your > friends jumped off a bridge, would you jump too? > > The xfs and ext3/ext4 cases are wrong too - but at least they spent > five seconds (but no more) thinking about it, and they added the check > for VM_SHARED. So they are only wrong for reason (b) > > But wrong is wrong. The new code is not right in gfs2, and the old > code in xfs/ext4 is not right either. > > Yeah, yeah, you can - and people do - do things like "always mark the > page readable on initial page fault, use mkwrite to catch when it > becomes writable, and do timestamps carefully, at at least have full > knowledge of "something might become dirty" > > But even then it is ENTIRELY BOGUS to do things like write locking. > > Really. > > Because the actual write HASN'T HAPPENED YET, AND YOU WILL RELEASE THE > LOCK BEFORE IT EVER DOES! So the lock? It does nothing. If you think > it protects anything at all, you're wrong. > > So don't do write locking. At an absolute most, you can do things like > > - update file times (and honestly, that's quite questionable - > because again - THE WRITE HASN'T HAPPENED YET - so any tests that > depend on exact file access times to figure out when the last write > was done is not being helped by your extra code, because you're > setting the WRONG time. > > - set some "this inode will have dirty pages" flag just for some > possible future use. But honestly, if this is about consistency etc, > you need to do it not for a fault, but across the whole mmap/munmap. > > So some things may be less bogus - but still very very questionable. > > But locking? Bogus. Reads and writes aren't really any different from > a timestamp standpoint (if you think you need to mtime for write > accesses, you should do atime for reads, so from a filesystem > timestamp standpoint read and write faults are exactly the same - and > both are bogus, because by definition for a mmap both the reads and > the writes can then happen long long long afterwards, and repeatedly). > > And if that "set some flag" thing needs a write lock, but a read > doesn't, you're doing something wrong and odd. > > Look at the VM code. The VM code does this right. The mmap_sem is > taken for writing for mmap and for munmap. But a page fault is always > a read lock, even if the access that caused the page fault is a write. > > The actual real honest-to-goodness *write* happens much later, and the > only time the filesystem really knows when it is done is at writeback > time. Not at fault time. So if you take locks, logically you should > take them when the fault happens, and release them when the writeback > is done. > > Are you doing that? No. So don't do the write lock over the read > portion of the page fault. It is not a sensible operation. Thanks for the detailed explanation. I'll work on a fix. Andreas