On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote: >> I didn't think of that at all. >> >> If userspace does: >> >> ptr = mmap(...); >> ptr[0] = 1; >> sleep(1); >> ptr[0] = 2; >> sleep(1); >> munmap(); >> >> Then current kernels will mark the inode changed on (only) the ptr[0] >> = 1 line. My patches will instead mark the inode changed when munmap >> is called (or after ptr[0] = 2 if writepages gets called for any >> reason). >> >> I'm not sure which is better. POSIX actually requires my behavior >> (which is most irrelevant). > > Not by my reading of it. Posix states that c/mtime needs to be > updated between the first access and the next msync() call. We > update mtime on the first access, and so therefore we conform to the > posix requirement.... It says "between a write reference to the mapped region and the next call to msync()." Most write references don't cause page faults. > >> My behavior also means that, if an NFS >> client reads and caches the file between the two writes, then it will >> eventually find out that the data is stale. > > "eventually" is very different behaviour to the current behaviour. > > My understanding is that NFS v4 delegations require the underlying > filesystem to bump the version count on *any* modification made to > the file so that delegations can be recalled appropriately. So not > informing the filesystem that the file data has been changed is > going to cause problems. We don't do that right now (and we can't without utterly destroying performance) because we don't trap on every modification. See below... > >> The current behavior, on >> the other hand, means that a single pass of mmapped writes through the >> file will update the times much faster. >> >> I could arrange for the first page fault to *also* update times when >> the FS is exported or if a particular mount option is set. (The ext4 >> change to request the new behavior is all of four lines, and it's easy >> to adjust.) > > What does "first page fault" mean? The first write to the page triggers a page fault and marks the page writable. The second write to the page (assuming no writeback happens in the mean time) does not trigger a page fault or notify the kernel in any way. In current kernels, this chain of events won't work: - Server goes down - Server comes up - Userspace on server calls mmap and writes something - Client reconnects and invalidates its cache - Userspace on server writes something else *to the same page* The client will never notice the second write, because it won't update any inode state. With my patches, the client will as soon as the server starts writeback. So I think that there are cases where my changes make things better and cases where they make things worse. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html