Oleg Nesterov <oleg@xxxxxxxxxx> writes: > To all: sorry for noise, I can't comment this patch. > > > But Eric, could you please help me to understand? I am totally confused. > > So, afaics, initially (even after MS_KERNMOUNT) fs_fully_visible("proc") > should return false. > > After the normal "mout -t proc none /proc/" it becomes true. > > And it is still true after, say, "mount -t ramfs none /proc/sys" because > "ls -ld /proc/sys" shows ->i_nlink == 1. > > However, say, "mount -t ramfs none /proc/tty/" should make > fs_fully_visible() == F, because in this case ->i_nlink == 4. > > Correct? > > If yes, could you explain what this "!CAP_SYS_ADMIN && !fs_fully_visible" > check actually tries to prevent and why? Right now a more accurate name would be fs_visible. The primary goal is to restrict the user namespace root to being able to mount proc and sysfs if nothing new is revealed, preserving the restrictions in jail type environments where proc and sysfs are not mounted at all. I would have dropped the ability for the userns root to create a fresh mount of proc and syfs, and required a bind mount instead except because of the namespace intertwining we need fresh mounts of proc and sys after new namespaces are created. What I really would love to have implemented would be a check that nothing is mounted on top of proc of sysfs but that fails because we have /sys/fs/.. and /proc/sys/fs_binfmt_misc. So I had to implement an empty directory check. And when I first implemented that empty directory check I had a complete and total brain fart. So what is really implemented at the moment would be more accurately called fs_visible(). For the real concern about jail environments where proc and sysfs are not mounted at all a fs_visible check is all that is really required, and if you look at 3.11? I think it was that is what was shipped. Unfortunately I cleaned things up and totally had a massive brain fart and failed to implement the is this an empty directory check properly and somehow failed to test mounting /proc in a real world environment and so caused a significant regression in 3.12 leading to a case where proc is not mountable. This patch just fixes that stupid regression. As the overmount protection in general is not needed as to my knolwedge no one overmounts proc or sysfs in a meaningful way. Now it would be very good to go back and fix fs_fully_visible to either be fs_visible and not care or to read overmounted directories and verify that they are actually empty. Reading directories at this point requires opening a file and probably playing with locks. Something I am not prepared to do for a trivial bug fix. Especially since getting it right only closes theoretical holes at the moment. Does that clarify things? Eric _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers