On Sat, May 22, 2021 at 12:09:30PM +0800, Menglong Dong wrote: > On Fri, May 21, 2021 at 11:50 PM Luis Chamberlain <mcgrof@xxxxxxxxxx> wrote: > > > > > That's a solution, but I don't think it is feasible. Users may create many > > > containers, and you can't make docker create all the containers first > > > and create network namespace later, as you don't know if there are any > > > containers to create later. > > > > It doesn't seem impossible, but worth noting why inside the commit log > > this was not a preferred option. > > > > In fact, the network namespace is just a case for the problem that the > 'mount leak' caused. And this kind modification is not friendly to > current docker users, it makes great changes to the usage of docker. You mean an upgrade of docker? If so... that does not seem like a definitive reason to do something new in the kernel *always*. However, if you introduce it as a kconfig option so that users who want to use this new feature can enable it, and then use it, the its sold as a new feature. Should this always be enabled, or done this way? Should we never have the option to revert back to the old behaviour? If not, why not? > > We still have: > > > > start_kernel() --> vfs_caches_init() --> mnt_init() --> > > > > mnt_init() > > { > > ... > > shmem_init(); > > init_rootfs(); > > init_mount_tree(); > > } > > > > You've now modified init_rootfs() to essentially just set the new user_root, > > and that's it. But we stil call init_mount_tree() even if we did set the > > rootfs to use tmpfs. > > The variate of 'is_tmpfs' is only used in 'rootfs_init_fs_context'. I used > ramfs_init_fs_context directly for rootfs, I don't see you using any context directly, where are you specifying the context directly? > so it is not needed any more > and I just removed it in init_rootfs(). > > The initialization of 'user_root' in init_rootfs() is used in: > do_populate_rootfs -> mount_user_root, which set the file system( > ramfs or tmpfs) of the second mount. > > Seems it's not suitable to place it in init_rootfs()...... OK I think I just need to understand how you added the context of the first mount explicitly now and where, as I don't see it. > > > In do_populate_ro > > > tmpfs, and that's the real rootfs for initramfs. And I call this root > > > as 'user_root', > > > because it is created for user space. > > > > > > int __init mount_user_root(void) > > > { > > > return do_mount_root(user_root->dev_name, > > > user_root->fs_name, > > > root_mountflags, > > > root_mount_data); > > > } > > > > > > In other words, I moved the realization of 'rootfs_fs_type' here to > > > do_populate_rootfs(), and fixed this 'rootfs_fs_type' with > > > ramfs_init_fs_context, as it is a fake root now. > > > > do_populate_rootfs() is called from populate_rootfs() and that in turn > > is a: > > > > rootfs_initcall(populate_rootfs); > > > > In fact the latest changes have made this to schedule asynchronously as > > well. And so indeed, init_mount_tree() always kicks off first. So its > > still unclear to me why the first mount now always has a fs context of > > ramfs_init_fs_context, even if we did not care for a ramdisk. > > > > Are you suggesting it can be arbitrary now? > > With the existence of the new user_root, the first mount is not directly used > any more, so the filesystem type of it doesn't matter. What do you mean? init_mount_tree() is always called, and it has statically: static void __init init_mount_tree(void) { struct vfsmount *mnt; ... mnt = vfs_kern_mount(&rootfs_fs_type, 0, "rootfs", NULL); ... } And as I noted, this is *always* called earlier than do_populate_rootfs(). Your changes did not remove the init_mount_tree() or modify it, and so why would the context of the above call always be OK to be used now with a ramfs context now? > So it makes no sense to make the file system of the first mount selectable. Why? I don't see why, nor is it explained, we're always caling vfs_kern_mount(&rootfs_fs_type, ...) and you have not changed that either. > To simplify the code here, I make it ramfs_init_fs_context directly. In fact, > it's fine to make it shmen_init_fs_context here too. So indeed you're suggesting its arbitrary now.... I don't see why. Luis