Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Tue, 21 Mar 2017 15:13:30 -0400

On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > - It's durable; the above comparison still works if there were reboots
> > > >   between the two i_version checks.
> > > > 	- I don't know how realistic this is--we may need to figure out
> > > > 	  if there's a weaker guarantee that's still useful.  Do
> > > > 	  filesystems actually make ctime/mtime/i_version changes
> > > > 	  atomically with the changes that caused them?  What if a
> > > > 	  change attribute is exposed to an NFS client but doesn't make
> > > > 	  it to disk, and then that value is reused after reboot?
> > > > 
> > > 
> > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > the inode dirty and I think that will end up with the new i_version at
> > > least being journalled before __mark_inode_dirty returns.
> > 
> > So you think the filesystem can provide the atomicity?  In more detail:
> > 
> 
> Sorry, I hit send too quickly. That should have read:
> 
> "Yeah, there could be atomicity issues there."
> 
> I think providing that level of atomicity may be difficult, though
> maybe there's some way to make the querying of i_version block until
> the inode update has been journalled?

No idea.  Anyway, I'd like to figure out some reasonable requirement
that we can document.

> 
> > 	- if I write to a file, a simultaneous reader should see either
> > 	  (old data, old i_version) or (new data, new i_version), not a
> > 	  combination of the two.
> > 	- ditto for metadata modifications.
> > 	- the same should be true if there's a crash.
> > 
> > (If that's not possible, then I think we could live with a brief window
> > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> > 
> > > That said, I suppose it is possible for us to bump the counter, hand
> > > that new counter value out to a NFS client and then the box crashes
> > > before it makes it to the journal.
> > > 
> > > Not sure how big a problem that really is.
> > 
> > The other case I was wondering about may have been unclear.  Represent
> > the state of a file by a (data, i_version) pair.  Say:
> > 
> > 	- file is modified from (F, V) to (F', V+1).
> > 	- client reads and caches (F', V+1).
> > 	- server crashes before writeback, so disk still has (F, V).
> > 	- after restart, someone else modifies file to (F'', V+1).
> > 	- original client revalidates its cache, sees V+1, concludes
> > 	  file data is still F'.
> > 
> > This may not cause a real problem for clients depending only on
> > traditional NFS close-to-open semantics.
> > 
> > 
> 
> No, I think that is a legitimate problem.
> 
> That said, after F'', the mtime would almost certainly be different
> from the time after F', and that would likely be enough to prevent
> confusion in NFS clients.

Oh, good point.  So, may be worth saying that anyone wanting to make
sense of these across reboot should compare times as well (maybe that
should be in nfs rfc's too).  I think that should be ctime not mtime,
though?

--b.