Re: fs/dcache.c - BUG: soft lockup - CPU#5 stuck for 22s! [systemd-udevd:1667]

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 26 May 2014 13:24:52 -0700

On Mon, May 26, 2014 at 11:26 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Mon, May 26, 2014 at 11:17:42AM -0700, Linus Torvalds wrote:
>> On Mon, May 26, 2014 at 8:27 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>> >
>> > That's the livelock.  OK.
>>
>> Hmm. Is there any reason we don't have some exclusion around
>> "check_submounts_and_drop()"?
>>
>> That would seem to be the simplest way to avoid any livelock: just
>> don't allow concurrent calls (we could make the lock per-filesystem or
>> whatever). This whole case should all be for just exceptional cases
>> anyway.
>>
>> We already sleep in that thing (well, "cond_resched()"), so taking a
>> mutex should be fine.
>
> What makes you think that it's another check_submounts_and_drop()?

Two things.

(1) The facts.

Just check the callchains on every single CPU in Mika's original email.

It *is* another check_submounts_and_drop() (in Mika's case, always
through kernfs_dop_revalidate()). It's not me "thinking" so. You can
see it for yourself.

This seems to be triggered by systemd-udevd doing (in every single thread):

readlink():
  SyS_readlink ->
    user_path_at_empty ->
      filename_lookup ->
        path_lookupat ->
          link_path_walk ->
            lookup_fast ->
              kernfs_dop_revalidate ->
                check_submounts_and_drop

and I suspect it's the "kernfs_active()" check that basically causes
that storm of revalidate failures that leads to lots of
check_submounts_and_drop() cases.

(2) The code.

Yes, the whole looping over the dentry tree happens in other places
too, but shrink_dcache_parents() is already called under s_umount, and
the out-of-memory pruning isn't done in a for-loop.

So if it's a livelock, check_submounts_and_drop() really is pretty
special. I agree that it's not necessarily unique from a race
standpoint, but it does seem to be in a class of its own when it comes
to be able to *trigger* any potential livelock. In particular, the
fact that kernfs can generate that sudden storm of
check_submounts_and_drop() calls when something goes away.

> I really, really wonder WTF is causing that - we have spent 20-odd
> seconds spinning while dentries in there were being evicted by
> something.  That - on sysfs, where dentry_kill() should be non-blocking
> and very fast.  Something very fishy is going on and I'd really like
> to understand the use pattern we are seeing there.

I think it literally is just a livelock. Just look at the NMI
backtraces for each stuck CPU: most of them are waiting for the dentry
lock in d_walk(). They have probably all a few dentries on their own
list. One of the CPU is actually _in_ shrink_dentry_list().

Now, the way our ticket spinlocks work, they are actually fair: which
means that I can easily imagine us getting into a pattern, where if
you have the right insane starting conditions, each CPU will basically
get their own dentry list.

That said, the only way I can see that nobody ever makes any progress
is if somebody as the inode locked, and then dentry_kill() turns into
a no-op. Otherwise one of those threads should always kill one or more
dentries, afaik. We do have that "trylock on i_lock, then trylock on
parent->d_lock", and if either of those fails, drop and re-try loop. I
wonder if we can get into a situation where lots of people hold each
others dentries locks sufficiently that dentry_kill() just ends up
failing and looping..

Anyway, I'd like Mika to test the stupid "let's serialize the dentry
shrinking in check_submounts_and_drop()" to see if his problem goes
away. I agree that it's not the _proper_ fix, but we're damn late in
the rc series..

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html