Re: [RFC] lustre treatment of dentry->d_name

"Drokin, Oleg" <oleg.drokin@xxxxxxxxx> · Wed, 22 Oct 2014 09:30:18 +0000

Hello!

On Oct 21, 2014, at 10:50 PM, Al Viro wrote:

>> Ah! I see now.
>> We do the "pick alias from a list" for everything, not just for the directories.
>> So that difference is still missing and as such there is a fear if we
>> convert to d_splice_alias in its current form, dcache would explode
>> from all the gazillions of invalid and unhashed dentries we produce during
>> operations.
> Interesting...  Where do those gazillions come from?  AFAICS, your
> ->d_revalidate() does that only when you get LOOKUP_OPEN | LOOKUP_CREATE
> in flags, i.e. on the final component of pathname at open() with O_CREAT.
> Hmm...  So basically you are trying to force them into ->atomic_open()
> codepath?  Fine, but... why not simply have ->open() pick what hadn't
> gone through ->atomic_open()?

It's not the d_revalidate open path that's causing them.
It's ll_md_blocking_ast()->ll_invalidate_aliases()
(and then subsequent check for d_lustre_invalid()).

The idea is like this:
When we get a valid dentry from server, it comes with a lock too.
As long as the lock is valid,  the dentry remains valid indefinitely.
Now, once you have some operation on the server that invalidates the state
(and revokes the lock), the client gets the message and declares that dentry
invalid.
In the past lustre locking was really coarse, since lock covered an inode
and everything (metadata) attached to it, including all the aliases.
It's better now, but still a sequence of say
"stat()-setattr()" or similar activity could cause a lot of reinstantiations
of dentry to the same file and their subsequent rapid invalidation.
(and the parts of this could happen on different nodes of course).

As for the actual open path, there's another complication in there.
Lustre has a so called "open cache", basically a cache for open file
handles, since opens are really expensive (one RPC roundtrip to metadata server,
and then another for the close) and when nfs is doing open/close
for every write/read, that really drags the whole lustre reexport performance
down (the biggest issue, but there might be other usecases for this, like
stupid apps that do a lot of open/closes for the same file).
We also don't want to do it for every file open because sometimes it's
too expensive to actually cache the handles in this way and might need
extra lock bouncing (like say open for exec vs open for write situation,
also create a bunch  of files and then delete them right away benchmarks).
So the logic is - if we come into the ->open() handler directly with no
traces of previous lookup-obtained open handle - that means we are either
called directly from NFS without any lookup, or it's a file that's
experiencing particularly high level of opens/closes and therefore it's
worth caching the open handle, so we set a special flag and the server
gives us the open handle and also the lock that allows us to cache that
handle.
Now, if dentry revalidation matched a dentry - that would result in a
straight call into ->open() without preceeding lookup (naturally),
but we now cannot distinguish if this is a real direct call to
->open() from the kernel user like NFS, or not.
In the past we just replicated part of lookup code in ->revalidate, so
it sent around lookup RPCs and then stored returned status in dentry
to be reused by ->open later, but that was not really pretty,
so the other way around seems to be to just never report dentry
cache hits for any opens (this part is not submitted upstream yet,
but it's in the queue). It'll look like this: http://review.whamcloud.com/11062

> Check what nfs4_file_open() is doing; if you find out that the damn thing
> *was* stale (i.e. that you really need a different inode, etc.), it's not
> a problem - d_drop() and return -EOPENSTALE; VFS will repeat lookups.
> Note that do_dentry_open() doesn't strip O_CREAT from file->f_flags until
> after return from ->open().  So you can see O_CREAT in ->open() just
> fine - this case is easy to distinguish there.

So in our invalid case, the thing is not STALE, we still need to repeat
the lookups and all, but in most cases underlying inode did not change and
so it makes no sense to have another dentry created for us, so we reuse
one of the stale ones we found.

> Incidentally, why the hell do you have separate ll_revalidate_nd() and
> ll_revalidate_dentry()?  I realize that it'll be inlined by compiler (the
> only call of the latter is tail-call with identical arguments from the
> former), but…

It's a compatibility artefact with older kernels that did not have atomic_open()
and the revalidate_nd had two arguments: dentry and nameidata (hence the name).
So we have two version of this ll_revalidate_nd, but only one in upstream
kernel.

> Another nasty question: is d_need_statahead() safe in RCU pathwalk mode?
> When are ll_dentry_data and ll_inode_info freed?  Ditto for ->lli_sai.
> Sure, actual freeing of struct inode and struct dentry is RCU-delayed,
> but from the quick glance it seems that freeing those guys isn't...

Hm…
This code was plagued by races since it's inception in it's previous form,
and the later rewrite that's in the upstream kernel.
So I won't be totally surprised if there's more of it left.
We had some recent fixes in lli_sai department for the various races of
use vs cleanup that I was hitting regularly in my special race-promoting
burn-in system. Once I get back home early next week, I should be able to
send them in and also see what they were.
ll_dentry_data is freed via ll_release's call to
call_rcu(&lld->lld_rcu_head, free_dentry_data);

the ll_inode_info is freed via ll_destroy_inode call to
call_rcu(&inode->i_rcu, ll_inode_destroy_callback);

so I think we should be fine on this front?

Bye,
    Oleg--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html