Re: replacement i_version counter for xfs

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 1 Feb 2023 10:31:20 +1100

On Tue, Jan 31, 2023 at 07:02:56AM -0500, Jeff Layton wrote:
> On Mon, 2023-01-30 at 13:05 +1100, Dave Chinner wrote:
> > On Wed, Jan 25, 2023 at 12:58:08PM -0500, Jeff Layton wrote:
> > > On Wed, 2023-01-25 at 08:32 -0800, Darrick J. Wong wrote:
> > > > On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> > > > > Note that there are two other lingering issues with i_version. Neither
> > > > > of these are xfs-specific, but they may inform the changes you want to
> > > > > make there:
> > > > > 
> > > > > 1/ the ctime and i_version can roll backward on a crash.
> > > > > 
> > > > > 2/ the ctime and i_version are both currently updated before write data
> > > > > is copied to the pagecache. It would be ideal if that were done
> > > > > afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> > > > > for this, but they need a lot more testing.)
> > > > 
> > > > You might also want some means for xfs to tell the vfs that it already
> > > > did the timestamp update (because, say, we had to allocate blocks).
> > > > I wonder what people will say when we have to run a transaction before
> > > > the write to peel off suid bits and another one after to update ctime.
> > > > 
> > > 
> > > That's a great question! There is a related one too once I started
> > > looking at this in more detail:
> > > 
> > > Most filesystems end up updating the timestamp via a the call to
> > > file_update_time in __generic_file_write_iter. Today, that's called very
> > > early in the function and if it fails, the write fails without changing
> > > anything.
> > > 
> > > What do we do now if the write succeeds, but update_time fails? We don't
> > 
> > On XFS, the timestamp update will either succeed or cause the
> > filesystem to shutdown as a failure with a dirty transaction is a
> > fatal, unrecoverable error.
> > 
> 
> Ok. So for xfs, we could move all of this to be afterward. Clearing
> setuid bits is quite rare, so that would only rarely require a
> transaction (in principle).

See my response in the other email about XFS and atomic buffered
write IO. We don't need to do an update after the write because
reads cannot race between the data copy and the ctime/i_version
update. Hence we only need one update, and it doesn't matter if it
is before or after the data copy into the page cache.

> > > want to return an error on the write() since the data did get copied in.
> > > Ignoring it seems wrong too though. There could even be some way to
> > > exploit that by changing the contents while holding the timestamp and
> > > version constant.
> > 
> > If the filesystem has shut down, it doesn't matter that the data got
> > copied into the kernel - it's never going to make it to disk and
> > attempts to read it back will also fail. There's nothing that can be
> > exploited by such a failure on XFS - it's game over for everyone
> > once the fs has shut down....
> > 
> > > At this point I'm leaning toward leaving the ctime and i_version to be
> > > updated before the write, and just bumping the i_version a second time
> > > after. In most cases the second bump will end up being a no-op, unless
> > > an i_version query races in between.
> > 
> > Why not also bump ctime at write completion if a query races with
> > the write()? Wouldn't that put ns-granularity ctime based change
> > detection on a par with i_version?
> > 
> > Userspace isn't going to notice the difference - the ctime they
> > observe indicates that it was changed during the syscall. So
> > who/what is going to care if we bump ctime twice in the syscall
> > instead of just once in this rare corner case?
> > 
> 
> We could bump the ctime too in this situation, but it would be more
> costly. In most cases the i_version bump will be a no-op. The only
> exception would be when a query of i_version races in between the two
> bumps. That wouldn't be the case with the ctime, which would almost
> always require a second transaction.

You've missed the part where I suggested lifting the "nfsd sampled
i_version" state into an inode state flag rather than hiding it in
the i_version field. At that point, we could optimise away the
secondary ctime updates just like you are proposing we do with the
i_version updates.  Further, we could also use that state it to
decide whether we need to use high resolution timestamps when
recording ctime updates - if the nfsd has not sampled the
ctime/i_version, we don't need high res timestamps to be recorded
for ctime....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx