Ram Pai <linuxram@xxxxxxxxxx> writes: > On Mon, May 15, 2017 at 03:10:38PM -0500, Eric W. Biederman wrote: >> >> It was observed that in some pathlogical cases that the current code >> does not unmount everything it should. After investigation it was >> determined that the issue is that mnt_change_mntpoint can can change >> which mounts are available to be unmounted during mount propagation >> which is wrong. >> >> The trivial reproducer is: >> $ cat ./pathological.sh >> >> mount -t tmpfs test-base /mnt >> cd /mnt >> mkdir 1 2 1/1 >> mount --bind 1 1 >> mount --make-shared 1 >> mount --bind 1 2 >> mount --bind 1/1 1/1 >> mount --bind 1/1 1/1 >> echo >> grep test-base /proc/self/mountinfo >> umount 1/1 >> echo >> grep test-base /proc/self/mountinfo >> >> $ unshare -Urm ./pathological.sh >> >> The expected output looks like: >> 46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000 >> 47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> >> 46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000 >> 47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> >> The output without the fix looks like: >> 46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000 >> 47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> >> 46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000 >> 47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> 52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 >> >> That last mount in the output was in the propgation tree to be unmounted but >> was missed because the mnt_change_mountpoint changed it's parent before the walk >> through the mount propagation tree observed it. >> > > Looks patch correct to me. > Reviewed-by: Ram Pai <linuxram@xxxxxxxxxx> > > BTW: The logic of find_topper() looks not-so-accurate to me. Why dont we > explicitly flag tucked mounts with MNT_TUCKED, and use that information > to determine if the child is really a topper? Currently we determine > the topper if it entirely is covering. How do we diambiguate that from an > entirely-covering-mount that is explicitly mounted by the administrator? > A topper situation is applicable only when tucked, right? In the current code explictly does not care about the difference. The code just restricts untucking mounts of any kind to umount propagation. This is where we have previously disagreed. A short summary of our previous discussions: Eric Biederman: find_topper makes tucked mounts ordinary mounts and is simple. Eric Biederman: I don't see a compelling case for a MNT_TUCKED flag Eric Biederman: I think the change is a nice behavioral improvement Ram Pai: a MNT_TUCKED flag would perfectly preserve existing behavior Ram Pai: find_topper while not perfect is better than the previous very special case for side/shadow mounts With respect to backwards compatibility the set of bugs I am fixing shows that it is possible to have some very egregious bugs in this area and in practice no one cares. Without a MNT_TUCKED flag I can readily tell what the following code should do by simply inspection of the of the mount propgation information in /proc/self/mountinfo: $ mount -t tmpfs test-base /mnt $ cd /mnt $ mkdir -p 1 2 1/1 $ mount --bind 1 1 $ mount --make-shared 1 $ mount --bind 1 2 $ mount --bind 1/1 1/1 $ mount --bind 1/1 1/1 $ umount 1/1 Before the umount /proc/self/mountinfo shows: 46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=502,gid=502 47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=502,gid=502 48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=502,gid=502 49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=502,gid=502 50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=502,gid=502 51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=502,gid=502 54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=502,gid=502 53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=502,gid=502 52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=502,gid=502 So it is clear to me that umount /mnt/1/1 should just leave: /mnt /mnt/1 /mnt/2 I would argue that is what the code should always have done. I believe the code with a MNT_TUCKED flag would leave: /mnt /mnt/1 /mnt/1/1 /mnt/2 But in truth it makes my head hurt to think about it. I don't see that MNT_TUCKED adds anything except aditional code complexity. I don't actually see what the value is in keeping mounts that you can not use (because they are overmounted) around. If the scenarios we were talking about were all limited to perfoming a mount and then undoing that mount I could almost see some value in a MNT_TUCKED flag. Given that one of the justications for tucking mounts in the first place is what happens when you umount something on a slave mount I really don't like it. As now I get the question what happens on a slave mount where a mount has been propagated and tucked, and then the topper is unmounted and a new topper is added. Should unmount on the parent untuck the propagated mount or leave it there? It was propagated it was tucked but it wasn't tucked under what is currenty on top. I much prefer the current semantics where we just say mount propagation can tuck and untuck things, and the history of how the mount tree got into its current shape is not important. Given how difficult it has been to make this code performant and correct I am not particularly eager to add complexity for unnecessary bug compatibility. But if it creates a breaking regression for something (other than a regression test) I am willing to add MNT_TUCKED. Eric