menglong8.dong@xxxxxxxxx writes: > From: Menglong Dong <dong.menglong@xxxxxxxxxx> > > If using container platforms such as Docker, upon initialization it > wants to use pivot_root() so that currently mounted devices do not > propagate to containers. An example of value in this is that > a USB device connected prior to the creation of a containers on the > host gets disconnected after a container is created; if the > USB device was mounted on containers, but already removed and > umounted on the host, the mount point will not go away until all > containers unmount the USB device. > > Another reason for container platforms such as Docker to use pivot_root > is that upon initialization the net-namspace is mounted under > /var/run/docker/netns/ on the host by dockerd. Without pivot_root > Docker must either wait to create the network namespace prior to > the creation of containers or simply deal with leaking this to each > container. > > pivot_root is supported if the rootfs is a initrd or block device, but > it's not supported if the rootfs uses an initramfs (tmpfs). This means > container platforms today must resort to using block devices if > they want to pivot_root from the rootfs. A workaround to use chroot() > is not a clean viable option given every container will have a > duplicate of every mount point on the host. > > In order to support using container platforms such as Docker on > all the supported rootfs types we must extend Linux to support > pivot_root on initramfs as well. This patch does the work to do > just that. > > pivot_root will unmount the mount of the rootfs from its parent mount > and mount the new root to it. However, when it comes to initramfs, it > donesn't work, because the root filesystem has not parent mount, which > makes initramfs not supported by pivot_root. > > In order to support pivot_root on initramfs we introduce a second > "user_root" mount which is created before we do the cpio unpacking. > The filesystem of the "user_root" mount is the same the rootfs. > > While mounting the 'user_root', 'rootflags' is passed to it, and it means > that we can set options for the mount of rootfs in boot cmd now. > For example, the size of tmpfs can be set with 'rootflags=size=1024M'. What is the flow where docker uses an initramfs? Just thinking about this I am not being able to connect the dots. The way I imagine the world is that an initramfs will be used either when a linux system boots for the first time, or an initramfs would come from the distribution you are running inside a container. In neither case do I see docker being in a position to add functionality to the initramfs as docker is not responsible for it. Is docker doing something creating like running a container in a VM, and running some directly out of the initramfs, and wanting that code to exactly match the non-VM case? If that is the case I think the easy solution would be to actually use an actual ramdisk where pivot_root works. I really don't see why it makes sense for docker to be a special snowflake and require kernel features that no other distribution does. It might make sense to create a completely empty filesystem underneath an initramfs, and use that new rootfs as the unchanging root of the mount tree, if it can be done with a trivial amount of code, and generally make everything cleaner. As this change sits it looks like a lot of code to handle a problem in the implementation of docker. Which quite frankly will be a pain to have to maintain if this is not a clean general feature that other people can also use. Eric