Hello Karel and all,
I'd like to ask you advice regarding proper usage of unshare+nsenter
to create persistent containers. I understand unshare(1) is rather
low-level, but it would like to still be able to understand how to use
it.
Apologise in advance for the long email, but I hope it will
result in better documentation (or at least better understanding for
me).
There are many bits and pieces of information
around (man pages and blogs and stack-overflow, etc.),
but I haven't been able to find an authoritative example
of using it to create a contained re-entrant persistent environment.
(If I missed it, please do point me to it).
Step 1: preparations
--------------------
All my testing was done stock Debian 8.7,
with kernel 3.16.39-1+deb8u1,
and util-linux 2.29.2 compiled from source.
All commands run as 'root'.
Extrapolating from unshare's man page about creating
a persistent environment:
basedir=/var/namespaces/ns1
mkdir -p $basedir
mount --bind $basedir $basedir
mount --make-private $basedir
for i in uts mnt pid net ipc user ;
do
touch $basedir/$i
done
Are these correct?
Step 2: creating shared namespace
---------------------------------
(for now, I'm ignoring user-namespace, as it brings
its own complications.)
Starting a new environment using the following:
unshare --uts=$basedir/uts \
--mount=$basedir/mnt \
--ipc=$basedir/ipc \
--pid=$basedir/pid \
--net=$basedir/net \
--mount-proc \
--fork \
sh -c 'hostname foobar ; exec /bin/bash -il'
And indeed I get a prompt inside the container:
root@foobar# ps ax
PID TTY STAT TIME COMMAND
1 pts/2 S 0:00 /bin/bash -il
8 pts/2 R+ 0:00 ps ax
root@foobar# ifconfig -a
lo Link encap:Local Loopback
LOOPBACK MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
On the outside host, I see the mounts and the namespaces:
# findmnt -O TARGET
[...]
└─/var/namespaces/ns1
├─/var/namespaces/ns1/ipc
├─/var/namespaces/ns1/uts
├─/var/namespaces/ns1/net
├─/var/namespaces/ns1/pid
└─/var/namespaces/ns1/mnt
# lsns
NS TYPE NPROCS PID USER COMMAND
[...]
4026532329 mnt 2 19221 root unshare --uts=..
4026532330 uts 2 19221 root unshare --uts=..
4026532331 ipc 2 19221 root unshare --uts=..
4026532332 pid 1 19223 root /bin/bash -il
4026532334 net 2 19221 root unshare --uts=..
Step 3: Re-entering
-------------------
Trying to enter based on PID works:
# nsenter -t 19223 -m -u -i -n -p \
sh -c 'hostname ; echo ; ps ax ; echo ; ifconfig -a'
foobar
PID TTY STAT TIME COMMAND
1 pts/2 S+ 0:00 /bin/bash -il
15 pts/1 S+ 0:00 sh -c hostname ; ps ax
17 pts/1 R+ 0:00 ps ax
lo Link encap:Local Loopback
LOOPBACK MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
However trying to enter by the persistent mounts does not
re-enter the pid/net namespace:
# nsenter --uts=$basedir/uts \
--mount=$basedir/mnt \
--ipc=$basedir/ipc \
--pid=$basedir/pid \
--net=$basedir/net \
sh -c 'hostname ; echo ; ps ax ; echo ; ifconfig -a'
foobar
Error, do this: mount -t proc proc /proc
Warning: cannot open /proc/net/dev (No such file or directory).
Limited output.
Listing /proc inside the container shows it only lists PID 1
(the running '/bin/bash' from the original 'unshare' invocation).
Based on naive reading of unshare(1) man page (with the example of
persistent UTS at the bottom), I assumed the above two examples with
PID and with persistent mount points should be equivalent.
Is this a kernel limitation ?
Step 4: PID namespace is never persistent?
------------------------------------------
IIUC, this is a kernel limitation:
If the program which is PID1 inside the container
terminates, there is no way to re-enter the PID namespace
(http://man7.org/linux/man-pages/man7/pid_namespaces.7.html).
Is that correct?
If so, perhaps it would be helpful to add a caveat in the
unshare/nsenter man pages, saying the PID namespace will
not persist if the process termintes?
And if this is the case, would the following
work to create a re-entrant persistent namespace:
unshare --uts=$basedir/uts \
--mount=$basedir/mnt \
--ipc=$basedir/ipc \
--pid=$basedir/pid \
--net=$basedir/net \
--mount-proc \
--fork \
sleep inf
Obviosuly sleep(1) is not a good PID1, but is it conceptually correct
way to ensure the PID namespace is persistent?
There are already some examples of minimal 'init' for containers:
https://github.com/Yelp/dumb-init
https://github.com/krallin/tini
and most minimal: https://gist.github.com/rofl0r/6168719
I wonder if you will be willing to consider a patch to add
something like 'unshare --do-nothing-init' which
will simply create a process that does nothing except handling signals
and never terminates, to facilitate truly persistent namespaces with
unshare(1) ? (if so I'm happy to try and write it).
Thank you for reaing so far.
regards,
- assaf
P.S.
I have more questions about proper usage of user-namespace and
switch_root/pivot_root, but I'll save them for later :)
P.P.S.
The download URL in the 2.92.2 announcement was http://ftp.kernel.org/
and it seems broken:
$ host ftp.kernel.org
Host ftp.kernel.org not found: 3(NXDOMAIN)
The working URL seems like 'www.kernel.org' (www. instead of ftp.):
https://www.kernel.org/pub/linux/utils/util-linux/
--
To unsubscribe from this list: send the line "unsubscribe util-linux" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html