On Thu, 10 Feb 2011, Brian Chrisman wrote: > On Mon, Feb 7, 2011 at 7:33 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > ... > > > > I believe the only place an actual MDS call is exposed to an NFS export is > > in export.c's __cfh_to_dentry(). This is where the ino search is going to > > need to get more sophisticated (at least on the client side). > > > > An ESTALE from the MDS generally means the starting ino in the request > > isn't in the cache. You can try all MDSs for one that has it. Beyond > > that, we'll need to implement more smarts on the server side! > > > > sage > > > > With further testing, I tracked this down to ESTALEs indeed being > returned from __cfh_to_dentry(). > I'm guessing this is because it has been flushed from the MDS cache, > as my max mds is 1 and it hasn't failed/migrated. > > It looks like CEPH_MDS_OP_LOOKUPHASH is failing to find the dentry... > I was hoping to see how the rest of the kernel client implements > lookup when LOOKUPHASH fails, but it looks like only export.c is using > that operation. Is it possible to perform a full lookup (past the > cache) of a file from a cfh? Would appreciate pointers on > implementation. The idea with LOOKUPHASH is to take a dir ino, dentry hash, and ino, and try to locate it on the MDS. The MDS will (currently) start with the dir (if it has it; otherwise ESTALE, what you're seeing), find the right directory fragment based on the dentry hash, and then look for the given ino in that dir frag. We can improve LOOKUPHASH to leverage the directory object backpointers on the MDS to make the dir location reliable. That shoud eliminate ESTALE for everything except the case where the file was renamed to a new directory and then dropped out of caches. Good enough, I hope? > I also noticed that NFS4ERR_FHEXPIRED is not referenced anywhere in > the kernel (particularly nfs client), so I'm guessing support for > filehandle expiry is quite a ways off. > > Another question: I'd like to reproduce this error more quickly by > reducing the mds cache size. I wanted to confirm 'mds_cache_size' is > what i'm looking for... and that I'd set it in the mds stanza of the > config with 'mds cache size = ####'? Right. You'll also want to reduce the size of the journal so that the dirty inodes are flushed to the dir objects more quickly (so they can be expired). 'mds log max segments = 2' should be okay. You'll need to scribble some other metadata to fill up the journal and make the item you care about get flushed/trimmed. sage