Re: Thoughts about cache consistency and directories in particular.

Neil Brown <neilb@xxxxxxx> · Sat, 21 Feb 2009 08:04:38 +1100

On Friday February 20, bfields@xxxxxxxxxxxx wrote:
> On Sat, Feb 21, 2009 at 06:47:58AM +1100, Neil Brown wrote:
> > 
> > I cannot see how there is a regression here.  Subsequent getattrs will
> > show all modifications (if you wait at least one second).
> > The first gettattr returns '-0.000000001', which is different from any
> > previously returned mtime.
> > Any subsequent getattr will return 0.999999999 or 1, depending on when
> > it arrives.
> 
> Sorry, I guess I misread "smallest difference that can be reported by
> the protocol" as "smallest difference supported by the filesystem"!
> 
> The former is currently *always* smaller than the latter, so you're
> reporting an mtime that will never arise in any other way.  So you're
> right, this results in strictly more cache revalidations in every case.
> 
> It may turn out that this mtime-1 case ends up being the typical case,
> since a single logical file modification may appear as multiple writes
> on the server, and those are likely to come in rapid succession.

When you look at a file that is changing, yes.  When you look at a
file that hasn't changed for a while, you just get normal mtime.

> 
> A "make" that takes less than one second, on an ext3 export, may result
> in targets with earlier mtimes than sources.

No.  If they were both changed in the last second, they will both have
1 nanosecond subtracted from the mtime, so they will still look like
they have the same mtime.

> 
> (Why not mtime+1?  And why not ctime?)

Not "mtime + 1" because mtime should normally be monotonically
increasing.
If you touch a file into the future, then when that moment comes, its
mtime will jump backwards 1 nanosecond, then jump forwards again.  But
I don't think that is a problem.

Why not what ctime?
I agree with Trond that we need to apply this adjustment to ctime too.
But we cannot set mtime == ctime.  That would just be wrong.

> 
> > The only possible regression is that sometimes we will flush the cache
> > when previously we didn't.  In each case where that changes, the
> > client can not possible know whether it needs to or not, so flushing
> > rather than not flushing is the safest option.
> > 
> > > 
> > > By the way, I have one sadly neglected todo here: ext4 has a real nfsv4
> > > changeattribute, which needs to be hooked up to the nfsd code.
> > 
> > Does it?
> > I just had a quick look, found that it stores a 64 bit number on disk
> > which is stored in inode->i_version.
> > And this is incremented for directory operations.  But it doesn't seem
> > to be changed for file operations.
> > 
> > But maybe I missed something.
> 
> After some mucking around with git and git grep... looks like the
> inode_inc_iversion() calls do the job.  Note there's an i_version mount
> option that's required.

Ahh.... Now if only they had called that function
"inode_inc_i_version", then my grep would have found it...

So in nfsd we can simply do:

 if (IS_I_VERSION(fhp->fh_dentry->d_inode)) {
	change_attribute = fhp->fh_dentry->d_inode->i_version;
 } else {
        change_attribute = kstat.ctime;
 }

??

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html