Re: Fixing NFS

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 7 Feb 2011 19:33:45 -0800 (PST)

On Mon, 7 Feb 2011, Brian Chrisman wrote:
> On Thu, Feb 3, 2011 at 9:52 AM, Tommi Virtanen
> <tommi.virtanen@xxxxxxxxxxxxx> wrote:
> > On Thu, Feb 03, 2011 at 09:29:32AM -0800, Sage Weil wrote:
> >> Which leaves us with a final problem: what if the fh is generated for
> >> /foo/bar, but bar is renamed to /baz/bar, bar drops out of all caches, and
> >> the client tries to use the fh.  We're still stuck with ESTALE in that
> >> case.  The only real solution there is to include a backpointer on the
> >> file's data object.  This is doable, but comes at a cost.  We could make
> >> it optional, and/or mitigate it somewhat (backpointer is only created once
> >> a file is renamed, or something like that).
> >>
> >> I'm not really sure to what lengths a server is supposed to go to avoid
> >> ESTALE.  I seem to remember that NFSv4 has a different class of fh's that
> >> are allowed to expire.  I'm not sure how that helps, though; it seems
> >> likeif a client has a file open that is renamed by another node and then
> >> idle for long enough and then tries to read it'll still be screwed,
> >> regardless of what the server does/does not promise the client.
> >
> > NFSv4 volatile filehandles move away from the whole "stale"
> > terminology into "expiring" filehandles, which a client SHOULD recover
> > from, and that's said with fairly strong language in RFC3530. The
> > volatile filehandles may go away at any moment (for FH4_VOLATILE_ANY).
> >
> > The RFC suggests clients remember the full path of every volatile
> > filehandle, and points out that doesn't let you recover if someone
> > else renamed the file.. which means your "final problem" above is
> > still a problem, and smells unavoidable. But at least shifting
> > responsibility for remembering the path to the client makes recovery
> > easy in the typical case.
> >
> > If the real-world support is there, I'd say NFSv4 is the way to go,
> > for future Ceph re-exporting.
> 
> 
> I was playing around with implementing this.  I was trying to get the
> ceph client's export functions to return NFS4ERR_FHEXPIRED instead of
> ESTALE (hoping that my nfs4 clients would then attempt the full lookup
> again).  I noticed also that the mds itself can also return an ESTALE
> to the ceph kernel client, which seems to be getting propagated back
> to the NFS client.  I'm wondering where I could intercept that and
> send back an expiry notice?

I believe the only place an actual MDS call is exposed to an NFS export is 
in export.c's __cfh_to_dentry().  This is where the ino search is going to 
need to get more sophisticated (at least on the client side).

An ESTALE from the MDS generally means the starting ino in the request 
isn't in the cache.  You can try all MDSs for one that has it.  Beyond 
that, we'll need to implement more smarts on the server side!

sage