On Tue, 04 Oct 2022, Amir Goldstein wrote: > On Mon, Oct 3, 2022 at 4:01 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > On Sun, 2022-10-02 at 10:08 +0300, Amir Goldstein wrote: > > > On Fri, Sep 30, 2022 at 2:30 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > > > > > The c/mtime and i_version currently get updated before the data is > > > > copied (or a DIO write is issued), which is problematic for NFS. > > > > > > > > READ+GETATTR can race with a write (even a local one) in such a way as > > > > to make the client associate the state of the file with the wrong change > > > > attribute. That association can persist indefinitely if the file sees no > > > > further changes. > > > > > > > > Move the setting of times to the bottom of the function in > > > > __generic_file_write_iter and only update it if something was > > > > successfully written. > > > > > > > > > > This solution is wrong for several reasons: > > > > > > 1. There is still file_update_time() in ->page_mkwrite() so you haven't > > > solved the problem completely > > > > Right. I don't think there is a way to solve the problem vs. mmap. > > Userland can write to a writeable mmap'ed page at any time and we'd > > never know. We have to specifically carve out mmap as an exception here. > > I'll plan to add something to the manpage patch for this. > > > > > 2. The other side of the coin is that post crash state is more likely to end > > > up data changes without mtime/ctime change > > > > > > > Is this really something filesystems rely on? I suppose the danger is > > that some cached data gets written to disk before the write returns and > > the inode on disk never gets updated. > > > > But...isn't that a danger now? Some of the cached data could get written > > out and the updated inode just never makes it to disk before a crash > > (AFAIU). I'm not sure that this increases our exposure to that problem. > > > > > > You are correct that that danger exists, but it only exists for overwriting > to allocated blocks. > > For writing to new blocks, mtime change is recorded in transaction > before the block mapping is recorded in transaction so there is no > danger in this case (before your patch). > > Also, observing size change without observing mtime change > after crash seems like a very bad outcome that may be possible > after your change. > > These are just a few cases that I could think of, they may be filesystem > dependent, but my gut feeling is that if you remove the time update before > the operation, that has been like that forever, a lot of s#!t is going to float > for various filesystems and applications. > > And it is not one of those things that are discovered during rc or even > stable kernel testing - they are discovered much later when users start to > realize their applications got bogged up after crash, so it feels like to me > like playing with fire. > > > > If I read the problem description correctly, then a solution that invalidates > > > the NFS cache before AND after the write would be acceptable. Right? > > > Would an extra i_version bump after the write solve the race? > > > > > > > I based this patch on Neil's assertion that updating the time before an > > operation was pointless if we were going to do it afterward. The NFS > > client only really cares about seeing it change after a write. > > > > Pointless to NFS client maybe. > Whether or not this is not changing user behavior for other applications > is up to you to prove and I doubt that you can prove it because I doubt > that it is true. > > > Doing both would be fine from a correctness standpoint, and in most > > cases, the second would be a no-op anyway since a query would have to > > race in between the two for that to happen. > > > > FWIW, I think we should update the m/ctime and version at the same time. > > If the version changes, then there is always the potential that a timer > > tick has occurred. So, that would translate to a second call to > > file_update_time in here. > > > > The downside of bumping the times/version both before and after is that > > these are hot codepaths, and we'd be adding extra operations there. Even > > in the case where nothing has changed, we'd have to call > > inode_needs_update_time a second time for every write. Is that worth the > > cost? > > Is there a practical cost for iversion bump AFTER write as I suggested? > If you NEED m/ctime update AFTER write and iversion update is not enough > then I did not understand from your commit message why that is. > > Thanks, > Amir. > Maybe we should split i_version updates from ctime updates. While it isn't true that ctime updates have happened before the write "forever" it has been true since 2.3.43[1] which is close to forever. For ctime there doesn't appear to be a strong specification of when the change happens, so history provides a good case for leaving it before. For i_version we want to provide clear and unambiguous semantics. Performing 2 updates makes the specification muddy. So I would prefer a single update for i_version, performed after the change becomes visible. If that means it has to be separate from ctime, then so be it. NeilBrown [1]: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=636b38438001a00b25f23e38747a91cb8428af29