On Sun, Oct 15, 2017 at 05:14:41PM -0500, Eric W. Biederman wrote: > > Looking at the code it appears ext4, f2fs, and xfs shutdown path > implements revoking a bdev from a filesystem. Further if the ext4 > implementation is anything to go by it looks like something we could > generalize into the vfs. There are two things which the current file system shutdown paths do. The first is that they prevent the file system from attempting to write to the bdev. That's all very file system specific, and can't be generalized into the VFS. The second thing they do is they cause system calls which might modify the file system to return an error. Currently operations that might result in _reads_ are not shutdown, so it's not a true revoke(2) functionality ala *BSD. I assume that's what you are talking about generalizing into the VFS. Personally, I would prefer to see us generalize something like vhangup() but which works on a file descriptor, not just a TTY. That it is, it disconnects the file descriptor entirely from the hardware / file system so in the case of the tty, it can be used by other login session, and in the case of the file descriptor belonging to a file system, it stops the file system from being unmounted. The reason why I want to see something which actually does disconnect the file descriptor and replaces the CWD for processes is because the shutdown path is inherently a fairly violent procedure. It forcibly disconnects the file system from the block device, which means we can't do a clean unmount. If the block device has disappeared, we don't have a choice, and that's the only thing we can do. But if the USB thumb drive hasn't been yanked out yet, it would be nice to simply disconnect everything at the file descriptor level, then do a normal umount in all of the namespaces. This will allow the file system to be cleanly unmounted, which means it will work for those file systems which don't have journals. (The ext4/btrfs/xfs shutdown paths rely on the file system's journalling capability because it causes the block device to look like the system had been powered off, and so you have to replay the journal when file system is remounted. That's fine, but it's not going to work on file systems like VFAT, or worse, on NTFS, where we don't correctly support the native file system's logging capability, so power-fail or a forced shutdown may end up corrupting the file system enough that you have boot into Windows to run CHKDSK.EXE in order clean it up.) > I won't call this a security regression with user namespaces, or a mount > namespace specific issue as it appears any process that can open a file > has always been able to trigger this. That said it does appear this has > gone from a theoretical issue to an actuall problem so it is most > definitely time to address and fix this. The reason why this has started become a really painful issue is that mount namespaces are extremely painful to debug. You can't iterate over all of the possible namespaces, without having to iterate over every process and every thread's /proc/*/mounts. So if there are 10 mount namespaces, but 10,000 threads, that's 10,000 /proc/*/mounts file that you have to examine individually, just to *find* that needle in the haystack which is keeping the mount pinned. And there is at least one system daemon where when it starts, it opens a new mount namespace. Which means if you happen to have a USB thumbdrive mounted when you restart that system daemon, it is not at all obvious what you need to do to unmount the USB thumb drive. And yes, any process can open a file, but people at least knew how to find it by using lsof. So perhaps this could be solved partially by better tooling, but I would argue that having userspace do a brute force search through all of /proc/*/fd/* actually wasn't particularly clean, either. We've lived with it, true. But perhaps it's time to do something better? And in the case where the system daemon only accidentally captured the USB thumb drive by forking a mount namespace, it would be *definitely* nice if we could simply make the mount disappear, instead of what we have to today, which is kill the system daemon, unmount the USB thumb drive, and then restart the system daemon. (Once, that is, you have figured what is going on first. Often, you end up reporting an ext4 bug to the ext4 maintainer first, because you think it's an ext4 bug that ext4 is refusing to unmount the file system. After all, you are an experienced system administrator, so you *know* that *nothing* could *possibly* be keeping the file system busy. lsof said so. :-) > Ted, Aleksa would either of you be interested in generalizing what ext4, > f2fs, and xfs does now and working to put a good interface on it? I can > help especially with review but for the short term I am rather booked. Unfortunately, I have way too much travel coming up in the short term, so I probably won't have to take on a new project until at least mid-to-late-November at the earliest. Aleska, do you have time? I can consult on a design, but I have zero coding time for the next couple of weeks. - Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>