Re: avoid unmounts in unprivileged containers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for your detailed answer / explanation Lennart, it's fully consistent with my code-browsing findings.

I've been struggling myself with the problem that you alluded above to identify "foreign" mountpoints. After banging my head against the wall for a while i ended up implementing an heuristic based on the major:minor-number field of the /proc/pid/mountinfo file: if the container mountpoint being considered has a major:minor-id that matches those major:minor-ids present in the host mount namespace, then this one is likely a "foreign" mountpoint, and shouldn't be unmounted.

Obviously, this would force you to extend the current systemd mountInfo parser. And there is a caveat as not all file-systems make use of a unique / differentiated ID for every new mountpoint (e.g. "/dev/null" fs always use the same major:minor id across different mount namespaces), so there could be false-positives, but that doesn't represent a problem in our case. Here is the specific code if you want to check it out: https://github.com/nestybox/sysbox-fs/blob/master/mount/infoParser.go#L828

Please let me know if you ever find a better approach.

cheers,

/Rodny

On Wed, Feb 24, 2021 at 9:19 AM Lennart Poettering <lennart@xxxxxxxxxxxxxx> wrote:
On Fr, 19.02.21 19:17, Rodny Molina (rodnymolina@xxxxxxxxx) wrote:

> Hi,
>
> As part of a prototype I'm working on to run systemd within an unprivileged
> docker container, I would like to prevent mountpoints created at runtime
> from being unmounted during the container shutdown process. I understand
> that systemd creates "<blah>.mount" units dynamically for
> these mountpoints as they show up in /proc/pid/mountinfo, but after reading
> the docs + code, I don't see a way to avoid these unmounts during the
> shutdown.target execution.

Yeah, it would be great if we could automatically determine "foreign
owned" mounts, and then step away from them. But there's really no way
for us to figure that out, at lesat to my knowledge. Ideally
/proc/self/mountinfo would tell us about this in some field, but it
really doesn't afaik.

> Interestingly, I see that there's code
> <https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L398>
> that
> skips the unmounting cycle attending to the ConditionVirtualization /
> containeinarized settings, which is what I need, but I'm not able to see
> that code being called during the container shutdown -- probably i'm not
> understanding systemd's fsm unwinding logic well enough ...

There are two phases of shutdown: the regular phase where we follow
mount unit deps, and stuff is umounted via /sbin/umount. i.e. where
the shutdown is handled by the usual unit logic.

And then there's the second phase which shutdown.c implements: it's a
separate binary that PID 1 invokes via execve() (so that it becomes
new PID 1) and then pretty robustly just tries to
umount/detach/disassembles/… without understanding of dependencies
what might be left over.

The first phase hence is the "clean" shutdown logic and the second
phase is the "dirty" fallback logic that tries really hard to sync/put
file systems into a clean state if the first phase fails (maybe
because some misplaced deps).

The second phase is skipped in containers, the first one is not. The
second phase is unnecessary in containers since the container manager
and namespace cleanup take care of this anyway, and even if it didn't,
the host's shutdown logic can take responsibility of all this.

Now, if the kernel would provide us with the info we'd generate the
deps for .mount units synthesized from /proc/self/mountinfo in a way
that "foreign owned" mounts won't get unmounted in phase 1, but we
simply can't do that automatically since we can't distinguish
them. :-(

You could manually define .mount units for all units you know are
owned by the outside container manager, but that is nasty and
fragile. The mount units would have to carefully have the right deps
(or better: should miss the right deps) to ensure things are clean
when shutting down.

So yeah, I#d love to fix this properly, generically, but this requires
some kernel work first, and that's not just a technical difficulty but
given the maintainer of said interfaces also a political one.

Lennart

--
Lennart Poettering, Berlin


--
/Rodny
_______________________________________________
systemd-devel mailing list
systemd-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux