Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 21 Mar 2017 17:54:00 -0400

On Tue, 2017-03-21 at 15:13 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > - It's durable; the above comparison still works if there were reboots
> > > > >   between the two i_version checks.
> > > > > 	- I don't know how realistic this is--we may need to figure out
> > > > > 	  if there's a weaker guarantee that's still useful.  Do
> > > > > 	  filesystems actually make ctime/mtime/i_version changes
> > > > > 	  atomically with the changes that caused them?  What if a
> > > > > 	  change attribute is exposed to an NFS client but doesn't make
> > > > > 	  it to disk, and then that value is reused after reboot?
> > > > > 
> > > > 
> > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > the inode dirty and I think that will end up with the new i_version at
> > > > least being journalled before __mark_inode_dirty returns.
> > > 
> > > So you think the filesystem can provide the atomicity?  In more detail:
> > > 
> > 
> > Sorry, I hit send too quickly. That should have read:
> > 
> > "Yeah, there could be atomicity issues there."
> > 
> > I think providing that level of atomicity may be difficult, though
> > maybe there's some way to make the querying of i_version block until
> > the inode update has been journalled?
> 
> No idea.  Anyway, I'd like to figure out some reasonable requirement
> that we can document.
> 
> > 
> > > 	- if I write to a file, a simultaneous reader should see either
> > > 	  (old data, old i_version) or (new data, new i_version), not a
> > > 	  combination of the two.
> > > 	- ditto for metadata modifications.
> > > 	- the same should be true if there's a crash.
> > > 
> > > (If that's not possible, then I think we could live with a brief window
> > > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> > > 
> > > > That said, I suppose it is possible for us to bump the counter, hand
> > > > that new counter value out to a NFS client and then the box crashes
> > > > before it makes it to the journal.
> > > > 
> > > > Not sure how big a problem that really is.
> > > 
> > > The other case I was wondering about may have been unclear.  Represent
> > > the state of a file by a (data, i_version) pair.  Say:
> > > 
> > > 	- file is modified from (F, V) to (F', V+1).
> > > 	- client reads and caches (F', V+1).
> > > 	- server crashes before writeback, so disk still has (F, V).
> > > 	- after restart, someone else modifies file to (F'', V+1).
> > > 	- original client revalidates its cache, sees V+1, concludes
> > > 	  file data is still F'.
> > > 
> > > This may not cause a real problem for clients depending only on
> > > traditional NFS close-to-open semantics.
> > > 
> > > 
> > 
> > No, I think that is a legitimate problem.
> > 
> > That said, after F'', the mtime would almost certainly be different
> > from the time after F', and that would likely be enough to prevent
> > confusion in NFS clients.
> 
> Oh, good point.  So, may be worth saying that anyone wanting to make
> sense of these across reboot should compare times as well (maybe that
> should be in nfs rfc's too).  I think that should be ctime not mtime,
> though?
> 

Yes, it might be worth a mention there. IIRC, it does mention that you
shouldn't just look at a single attribute for cache validation
purposes, but the wording is a bit vague. I can't find the section at
the moment though.

The more I think about it though, simply ensuring that we don't publish
 a new change attr until the inode update has hit the journal may be
the best we can do. I'd have to think about how to implement that
though.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>