On Mon, May 26, 2014 at 11:26 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > On Mon, May 26, 2014 at 11:17:42AM -0700, Linus Torvalds wrote: >> On Mon, May 26, 2014 at 8:27 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: >> > >> > That's the livelock. OK. >> >> Hmm. Is there any reason we don't have some exclusion around >> "check_submounts_and_drop()"? >> >> That would seem to be the simplest way to avoid any livelock: just >> don't allow concurrent calls (we could make the lock per-filesystem or >> whatever). This whole case should all be for just exceptional cases >> anyway. >> >> We already sleep in that thing (well, "cond_resched()"), so taking a >> mutex should be fine. > > What makes you think that it's another check_submounts_and_drop()? Two things. (1) The facts. Just check the callchains on every single CPU in Mika's original email. It *is* another check_submounts_and_drop() (in Mika's case, always through kernfs_dop_revalidate()). It's not me "thinking" so. You can see it for yourself. This seems to be triggered by systemd-udevd doing (in every single thread): readlink(): SyS_readlink -> user_path_at_empty -> filename_lookup -> path_lookupat -> link_path_walk -> lookup_fast -> kernfs_dop_revalidate -> check_submounts_and_drop and I suspect it's the "kernfs_active()" check that basically causes that storm of revalidate failures that leads to lots of check_submounts_and_drop() cases. (2) The code. Yes, the whole looping over the dentry tree happens in other places too, but shrink_dcache_parents() is already called under s_umount, and the out-of-memory pruning isn't done in a for-loop. So if it's a livelock, check_submounts_and_drop() really is pretty special. I agree that it's not necessarily unique from a race standpoint, but it does seem to be in a class of its own when it comes to be able to *trigger* any potential livelock. In particular, the fact that kernfs can generate that sudden storm of check_submounts_and_drop() calls when something goes away. > I really, really wonder WTF is causing that - we have spent 20-odd > seconds spinning while dentries in there were being evicted by > something. That - on sysfs, where dentry_kill() should be non-blocking > and very fast. Something very fishy is going on and I'd really like > to understand the use pattern we are seeing there. I think it literally is just a livelock. Just look at the NMI backtraces for each stuck CPU: most of them are waiting for the dentry lock in d_walk(). They have probably all a few dentries on their own list. One of the CPU is actually _in_ shrink_dentry_list(). Now, the way our ticket spinlocks work, they are actually fair: which means that I can easily imagine us getting into a pattern, where if you have the right insane starting conditions, each CPU will basically get their own dentry list. That said, the only way I can see that nobody ever makes any progress is if somebody as the inode locked, and then dentry_kill() turns into a no-op. Otherwise one of those threads should always kill one or more dentries, afaik. We do have that "trylock on i_lock, then trylock on parent->d_lock", and if either of those fails, drop and re-try loop. I wonder if we can get into a situation where lots of people hold each others dentries locks sufficiently that dentry_kill() just ends up failing and looping.. Anyway, I'd like Mika to test the stupid "let's serialize the dentry shrinking in check_submounts_and_drop()" to see if his problem goes away. I agree that it's not the _proper_ fix, but we're damn late in the rc series.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html