[Dammit, I thought I sent this earlier -- it looks like Thunderbird
swallowed my original draft for this email.]
Hi Eric,
This is the bug that I talked to you about at LPC, related to
devicemapper and it not being possible to issue DELETE and REMOVE
operations on a devicemapper device that is still mounted in
$some_namespace. [Before we go on, deferred removal and deletion can
help here, but the deferral will never kick in until the reference goes
away. On SUSE systems, deferred removal doesn't appear to work at all,
but that's an issue for us to solve.]
*Scratches my head*
That most definitely is not a leak. It is a creation of a duplicate set
of mounts.
Sorry for my sloppy wording, "leak" in this context means (for me at
least) that the mount has been duplicated into a mount namespace due to
sloppiness or malice by userspace.
It is true that unmount in the parent will not remove those duplicate
mounts in the other mount namespaces. Unlink of the mountpoint will
ensure that nothing is mounted on a file or directory in any mount
namespace.
Given that they are duplicate of the mount they will not make any umount
fail in the original namespace.
The error is a bit confusing. If you read the code, it turns out that
"unmount failed" doesn't actually mean that umount(2) returned EBUSY. It
means that the umount(2) *succeeded* and then the subsequent cleanup by
Docker (where it tries to remove and delete unused devicemapper devices)
fails. This is caused by the mount still existing in a container's mount
namespace, due either a race condition (that I may have patch to fix[1])
or due to the container inadvertently bind-mounting the rootfs mounts
from the host (I've seen people do ridiculous things like a recursive
bind-mount of '/' into a container, for example).
Further mount namespaces reference counts are held open by processes
directly or file descriptors or other mount namespaces which are held
open by processes. So killing the appropriate set of processes should
make the mount namespace go away.
I have heard tell of cases where mounts make it into mount namespaces
where no one was expecting them to exist, and no one wanting to kill the
processes using those mount namespaces. But that is mostly sloppiness.
That is not a big bad leak that can not be prevented, it is an "oops did
I put that there?" kind of situation.
If there is something special about device mapper that needs to be take
into account I would love to hear about it.
You're right that most of this issue (in the case of Docker) is just
sloppiness, though the architecture makes it quite hard to resolve this
issue at the moment. But the problem in Docker /can/ be solved at least
partially by just removing the mountpoint. In fact I have a patch to do
that already[2].
However, there is a more fundamental issue here, one which is quite
concerning. It basically boils down to the fact that any unprivileged
user can create a reference to a devicemapper mount on the host in such
a way that the host won't know about it. A toy example is the following:
% unshare -rm
# mount --make-rprivate
# mount -t tmpfs tmpfs /tmp && mkdir /tmp/saved_mount
# mount --rbind /some/devicemapper/mount /tmp/saved_mount
At this point, even if the host does an `rm /some/devicemapper/mount`,
the devicemapper reference will stick around in the "container". This
isn't an issue for most cases, but with devicemapper, the host might
want to do more management operations than just ` rm
/some/devicemapper/mount`. They probably (like Docker does) want to
remove the device and/or delete the device. In the above situation, they
would be blocked from doing so. As I mention above, deferred deletion
and removal can help here (the operation succeeds on non-SUSE systems)
but there is still an issue that the space will not be reclaimed because
the deferred operation will never kick off (because there is still a
reference lying around).
As you've said, 8ed936b5671b ("vfs: Lazily remove mounts on unlinked
files and directories.") added the ability to reduce possible DoS
attacks against rmdir(2) by forcing unmounts in all mount namespaces
that would block the removal from working. I'm wondering whether there
would be interest in doing something similar for devicemapper's DELETE
and REMOVE operations? From my perspective, it is quite hard for
userspace to be able to resolve this issue. You mentioned that
> killing the appropriate set of processes should
> make the mount namespace go away.
But I'm not sure I understand how userspace can tell what "the
appropriate set of processes" is. Not to mention that the appropriate
set of processes could be an "innocent" process inside a container that
accidentally inherited the mount, and then a malicious process
duplicated the mount further -- making it hard to remove by the host.
[1]: https://github.com/opencontainers/runc/pull/1500
[2]: https://github.com/docker/docker/pull/34573
--
Aleksa Sarai
Snr. Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html