On Sat, Aug 2, 2014 at 12:09 AM, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > Richard Weinberger <richard@xxxxxx> writes: > >> Am 01.08.2014 17:44, schrieb Ram Pai: >>> On Fri, Aug 01, 2014 at 12:17:13AM +0200, Richard Weinberger wrote: >>>> Am 30.07.2014 22:46, schrieb Richard Weinberger: >>>>> Am 30.07.2014 15:59, schrieb Richard Weinberger: >>>>>> If we use the plain list_empty() we might not see the >>>>>> hlist_del_init_rcu() and therefore miss one member of the >>>>>> list. >>>>>> >>>>>> It fixes the following issue: >>>>>> $ unshare -m /usr/bin/sleep 10000 & >>>>>> $ mkdir -p foo/proc >>>>>> $ mount -t proc none foo/proc >>>>>> $ mount -t binfmt_misc none foo/proc/sys/fs/binfmt_misc >>>>>> $ umount -l foo/proc >>>>>> $ rmdir foo/proc >>>>>> rmdir: failed to remove ‘foo/proc’: Device or resource busy >>>>> >>>>> Although my fix was wrong, the issue is real, it seems to exist for a very long >>>>> time. Just was able to reproduce it on 2.6.32. >>>>> Please note that you need a shared root subtree to trigger the issue. >>>>> i.e. mount --shared / >>>>> Maybe this is why nobody noticed it so far as only systemd distros >>>>> have the root subtree shared by default. >>>>> >>>>> I hit the issue on openSUSE 13.1 where an application creates a chroot environment >>>>> and then lazy umounts /proc. >>>>> It happened on very few machines. An analysis showed that only boxes with an OpenVPN tunnel >>>>> were affected. This did not make any sense until I discovered that the OpenVPN systemd >>>>> service file has set "PrivateTmp=true". This setting creates >>>>> a mount namespace for the said service... >>>>> >>>>> In __propagate_umount() the following piece of code is interesting: >>>>> >>>>> /* >>>>> * umount the child only if the child has no >>>>> * other children >>>>> */ >>>>> if (child && list_empty(&child->mnt_mounts)) { >>>>> hlist_del_init_rcu(&child->mnt_hash); >>>>> hlist_add_before_rcu(&child->mnt_hash, &mnt->mnt_hash); >>>>> } >>>>> >>>>> child->mnt_mounts is non-empty for the "proc" although the "binfmt_misc" >>>>> subtree was removed. >>>>> I'm not sure whether this is only one more symptom or the main culprit. >>>> >>>> CC'ing Ram Pai. >>>> >>>> Ram, you are the author of the said code. Can you please explain why we need that >>>> list_empty() check? >>>> To my (limited) understanding of VFS, the following change should be fine to fix the issue: >>> >>> We had made a rule then, that busy vfsmounts cannot be lazily unmounted >>> **implicitly**. Propagated unmounts are implicit unmounts, and if such >>> implicit vfsmounts have child-mounts than obviously they are busy, and >>> hence they cannot be lazy-unmounted implicitly. >>> >>> the list_empty() is checking for no child-mounts on the vfsmount before >>> letting it unmount. >>> >>> We did not want a bunch of mounts disappear without the users knowledge. >>> Hence we made the above rule. >>> >>> Al Viro, will have more insights into this. >> >> Hmm, with the root subtree shared by default this policy will be problematic and >> lead to problems. >> As I observe on openSUSE 13.1. >> >> Al, what do you think? > > I have a pending patchset that causes the rmdir to cause all of the > mounts to go away. It has passed review and has not been merged only > because of stack overflow concerns (which I have not had time to fully > address). > > Sigh. It badly breaks unix semantics for rmdir unlink with no mounts in > the local namespace to fail, and it introduces as denial of service > attack from unprivielged users. Thanks for the pointer! I fear your patch series is nothing we can easily feed into -stable. Is this really the only acceptable solution? The thing is, users get already bitten by that. i.e. run openSUSE's KIWI tool on a machine where a systemd service with PrivateTmp=yes is installed and you'll end up with a stale mount point. KIWI creates a chroot, populates /proc, lazily unmounts it later and then fails to remove the temporary chroot directory because of EBUSY. -- Thanks, //richard -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html