Re: [2.6.38-3.x] [BUG] soft lockup - CPU#X stuck for 23s! (vfs, autofs, vserver)

Herbert Poetzl <herbert@xxxxxxxxxxxx> · Tue, 25 Sep 2012 07:05:59 +0200

On Mon, Sep 24, 2012 at 11:17:42AM -0700, Eric W. Biederman wrote:
> Herbert Poetzl <herbert@xxxxxxxxxxxx> writes:

>> On Mon, Sep 24, 2012 at 07:23:55AM +0200, Paweł Sikora wrote:
>>> On Sunday 23 of September 2012 18:10:30 Linus Torvalds wrote:
>>>> On Sat, Sep 22, 2012 at 11:09 PM, Paweł Sikora <pluto@xxxxxxxxxxxxx> wrote:

>>>>>         br_read_lock(vfsmount_lock);

>>>> The vfsmount_lock is a "local-global" lock, where a read-lock
>>>> is rather cheap and takes just a per-cpu lock, but the
>>>> downside is that a write-lock is *very* expensive, and can
>>>> cause serious trouble.

>>>> And the write lock is taken by the [un]mount() paths. Do *not*
>>>> do crazy things. If you do some insane "unmount and remount
>>>> autofs" on a 1s granularity, you're doing insane things.

>>>> Why do you have that 1s timeout? Insane.

>>> 1s unmount timeout is *only* for fast bug reproduction (in few
>>> seconds after opteron startup) and testing potential patches.
>>> normally with 60s timeout it happens in few minutes..hours
>>> (depends on machine i/o+cpu load) and makes server unusable
>>> (permament soft-lockup).

>>> can we redesign vserver's mnt_is_reachable() for better locking
>>> to avoid total soft-lockup?

>> currently we do:

>>         br_read_lock(&vfsmount_lock);
>>         root = current->fs->root;
>>         root_mnt = real_mount(root.mnt);
>>         point = root.dentry;

>>         while ((mnt != mnt->mnt_parent) && (mnt != root_mnt)) {
>>                 point = mnt->mnt_mountpoint;
>>                 mnt = mnt->mnt_parent;
>>         }

>>         ret = (mnt == root_mnt) && is_subdir(point, root.dentry);
>>         br_read_unlock(&vfsmount_lock);

>> and we have been considering to move the br_read_unlock()
>> right before the is_subdir() call

>> if there are any suggestions how to achieve the same
>> with less locking I'm all ears ...

> Herbert, why do you need to filter the mounts that show up in a
> mount namespace at all?

that is actually a really good question!

> I would think a far more performant and simpler solution would
> be to just use mount namespaces without unwanted mounts.

we had this mechanism for many years, long before the
mount namespaces existed, and I vaguely remember that
early versions didn't get the proc entries right either

I took a quick look at the code and I think we can drop
the mnt_is_reachable() check and/or make it conditional
on setups without a mount namespace in place in the near
future (thanks for the input, really appreciated!)

> I'd like to blame this on the silly rcu_barrier in
> deactivate_locked_super that should really be in the module
> remove path, but that happens after we drop the br_write_lock.

> The kernel take br_read_lock(&vfs_mount_lokck) during every rcu
> path lookup so mnt_is_reachable isn't particular crazy just for
> taking the lock.

> I am with Linus on this one. Paweł even 60s for your mount
> timeout looks too short for your workload. All of the readers
> that take br_read_lock(&vfsmount_lock) seem to be showing up in
> your oops. The only thing that seems to make sense is you have
> a lot of unmount activity running back to back, keeping the
> lock write held.

> The only other possible culprit I can see is that it looks like
> mnt_is_reachable changes reading /proc/mounts to be something
> worse than linear in the number of mounts and reading /proc/mounts
> starts taking the vfsmount_lock.  All minor things but when you
> are pushing things hard they look like things that would add up.

> Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html