On Fri, May 21, 2021 at 11:50 PM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote: > > > That's a solution, but I don't think it is feasible. Users may create many > > containers, and you can't make docker create all the containers first > > and create network namespace later, as you don't know if there are any > > containers to create later. > > It doesn't seem impossible, but worth noting why inside the commit log > this was not a preferred option. > In fact, the network namespace is just a case for the problem that the 'mount leak' caused. And this kind modification is not friendly to current docker users, it makes great changes to the usage of docker. > > We still have: > > start_kernel() --> vfs_caches_init() --> mnt_init() --> > > mnt_init() > { > ... > shmem_init(); > init_rootfs(); > init_mount_tree(); > } > > You've now modified init_rootfs() to essentially just set the new user_root, > and that's it. But we stil call init_mount_tree() even if we did set the > rootfs to use tmpfs. The variate of 'is_tmpfs' is only used in 'rootfs_init_fs_context'. I used ramfs_init_fs_context directly for rootfs, so it is not needed any more and I just removed it in init_rootfs(). The initialization of 'user_root' in init_rootfs() is used in: do_populate_rootfs -> mount_user_root, which set the file system( ramfs or tmpfs) of the second mount. Seems it's not suitable to place it in init_rootfs()...... > > In do_populate_ro > > tmpfs, and that's the real rootfs for initramfs. And I call this root > > as 'user_root', > > because it is created for user space. > > > > int __init mount_user_root(void) > > { > > return do_mount_root(user_root->dev_name, > > user_root->fs_name, > > root_mountflags, > > root_mount_data); > > } > > > > In other words, I moved the realization of 'rootfs_fs_type' here to > > do_populate_rootfs(), and fixed this 'rootfs_fs_type' with > > ramfs_init_fs_context, as it is a fake root now. > > do_populate_rootfs() is called from populate_rootfs() and that in turn > is a: > > rootfs_initcall(populate_rootfs); > > In fact the latest changes have made this to schedule asynchronously as > well. And so indeed, init_mount_tree() always kicks off first. So its > still unclear to me why the first mount now always has a fs context of > ramfs_init_fs_context, even if we did not care for a ramdisk. > > Are you suggesting it can be arbitrary now? With the existence of the new user_root, the first mount is not directly used any more, so the filesystem type of it doesn't matter. So it makes no sense to make the file system of the first mount selectable. To simplify the code here, I make it ramfs_init_fs_context directly. In fact, it's fine to make it shmen_init_fs_context here too. > > Now, the rootfs that user space used is separated with the init_task, > > and that's exactly what a block root file system does. > > The secondary effort is a bit clearer, its the earlier part that is not > so clear yet to me at least. > > Regardless, to help make the changes easier to review, I wonder if it > makes sense to split up your work into a few patches. First do what you > have done for init_rootfs() and the structure you added to replace the > is_tmpfs bool, and let initialization use it and the context. And then > as a second patch introduce the second mount effort. I thinks it is a good idea. I will reorganize the code and split it into several patches. Thanks! Menglong Dong