Re: Fixing NFS

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 10 Feb 2011 11:03:15 -0800 (PST)

On Thu, 10 Feb 2011, Brian Chrisman wrote:
> On Mon, Feb 7, 2011 at 7:33 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> ...
> >
> > I believe the only place an actual MDS call is exposed to an NFS export is
> > in export.c's __cfh_to_dentry().  This is where the ino search is going to
> > need to get more sophisticated (at least on the client side).
> >
> > An ESTALE from the MDS generally means the starting ino in the request
> > isn't in the cache.  You can try all MDSs for one that has it.  Beyond
> > that, we'll need to implement more smarts on the server side!
> >
> > sage
> >
> 
> With further testing, I tracked this down to ESTALEs indeed being
> returned from __cfh_to_dentry().
> I'm guessing this is because it has been flushed from the MDS cache,
> as my max mds is 1 and it hasn't failed/migrated.
> 
> It looks like CEPH_MDS_OP_LOOKUPHASH is failing to find the dentry...
> I was hoping to see how the rest of the kernel client implements
> lookup when LOOKUPHASH fails, but it looks like only export.c is using
> that operation.  Is it possible to perform a full lookup (past the
> cache) of a file from a cfh?  Would appreciate pointers on
> implementation.

The idea with LOOKUPHASH is to take a dir ino, dentry hash, and ino, and 
try to locate it on the MDS.  The MDS will (currently) start with the dir 
(if it has it; otherwise ESTALE, what you're seeing), find the right 
directory fragment based on the dentry hash, and then look for the given 
ino in that dir frag.

We can improve LOOKUPHASH to leverage the directory object backpointers on 
the MDS to make the dir location reliable.  That shoud eliminate ESTALE 
for everything except the case where the file was renamed to a new 
directory and then dropped out of caches.  Good enough, I hope?

> I also noticed that NFS4ERR_FHEXPIRED is not referenced anywhere in
> the kernel (particularly nfs client), so I'm guessing support for
> filehandle expiry is quite a ways off.
> 
> Another question: I'd like to reproduce this error more quickly by
> reducing the mds cache size.  I wanted to confirm 'mds_cache_size' is
> what i'm looking for... and that I'd set it in the mds stanza of the
> config with 'mds cache size = ####'?

Right.  You'll also want to reduce the size of the journal so that the 
dirty inodes are flushed to the dir objects more quickly (so they can be 
expired).  'mds log max segments = 2' should be okay.  You'll need to 
scribble some other metadata to fill up the journal and make the item you 
care about get flushed/trimmed.

sage