On Tue, 14 Mar 2023 at 10:45, Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote: > > On Tue, 14 Mar 2023 at 03:26, Theodore Ts'o <tytso@xxxxxxx> wrote: > > > > On Mon, Mar 13, 2023 at 03:53:57PM +0100, Dmitry Vyukov wrote: > > > > Long-term we are moving ext4 in a direction where we can disallow block > > > > device modifications while the fs is mounted but we are not there yet. I've > > > > discussed some shorter-term solution to avoid such known problems with syzbot > > > > developers and what seems plausible would be a kconfig option to disallow > > > > writing to a block device when it is exclusively open by someone else. > > > > But so far I didn't get to trying whether this would reasonably work. Would > > > > you be interested in having a look into this? > > > > > > Does this affect only the loop device or also USB storage devices? > > > Say, if the USB device returns different contents during mount and on > > > subsequent reads? > > > > Modifying the block device while the file system is mounted is > > something that we have to allow for now because tune2fs uses it to > > modify the superblock. It has historically also been used (rarely) by > > people who know what they are doing to do surgery on a mounted file > > system. If we create a way for tune2fs to be able to update the > > superblock via some kind of ioctl, we could disallow modifying the > > block device while the file system is mounted. Of course, it would > > require waiting at least 5-6 years since sometimes people will update > > the kernel without updating userspace. We'd also need to check to > > make sure there aren't boot loader installer (such as grub-install) > > that depend on being able to modify the block device while the root > > file system is mounted, at least in some rare cases. > > > > The "how" to exclude mounted file systems is relatively easy. The > > kernel already knows when the file system is mounted, and it is > > already a supported feature that a userspace application that wants to > > be careful can open a block device with O_EXCL, and if it is in use by > > the kernel --- mounted by a file system, being used by dm-thin, et. al > > -- the open(2) system call will fail. From the open(2) man page. > > > > In general, the behavior of O_EXCL is undefined if it is used without > > O_CREAT. There is one exception: on Linux 2.6 and later, O_EXCL can > > be used without O_CREAT if pathname refers to a block device. If the > > block device is in use by the system (e.g., mounted), open() fails > > with the error EBUSY. > > > > Something which the syzbot could to do today is to simply use O_EXCL > > whenever trying to open a block device. This would avoid a class of > > syzbot false positives, since normally it requires root privileges > > and/or an experienced sysadmin to try to modify a block device while > > it is mounted and/or in use by LVM. > > > > - Ted > > > > P.S. Trivia note: Aproximately month after I started work at VA Linux > > Systems, a sysadmin intern which was given the root password to > > sourceforge.net, while trying to fix a disk-to-disk backup, ran > > mkfs.ext3 on /dev/hdXX, which was also being used as one-half of a > > RAID 0 setup on which open source code critical to the community > > (including, for example, OpenGL) was mounted and serving. The intern > > got about 50% the way through zeroing the inode table on /dev/hdXX > > before the file system noticed and threw an error, at which point > > wiser heads stopped what the intern was doing and tried to clean up > > the mess. Of course, there were no backups, since that was what the > > intern was trying to fix! > > > > There are a couple of things that we could learn from this incident. > > One was that giving the root password to an untrained intern not > > familiar with the setup on the serving system was... an unfortunate > > choice. Another was that adding the above-mentioned O_EXCL feature > > and teaching mkfs to use it was an obvious post-mortem action item to > > prevent this kind of problem in the future... > > I am struggling to make my mind re how to think about this case. > > "root" is very overloaded, but generally it does not mean "randomly > corrupting memory". Normally it gives access to system-wide changes > but with the same protection/consistency guarantees as for > unprivileged system calls. > > There are, of course, things like /dev/{mem,kmem}. But at the same > time there is also lockdown LSM and more distros today enable it. > > Btw, should this "prohibit writes to mounted device" be part of > LOCKDOWN_INTEGRITY? It looks like it gives capabilities similar to > /dev/{mem,kmem}. > > Disabling in testing something that's enabled in production is > generally not very useful. > So one option is to do nothing about this for now. > If it's a true recognized issue that is in the process of fixing, > syzbot will just show that it's still present. One of the goals of > syzbot is to show the current state of things in an objective manner. > If some kernel developers are aware of an issue, it does not mean that > most distros/users are aware. > > It makes sense to disable in testing things that are also recommended > to be disabled in production settings. > And LOCKDOWN_INTEGRITY may play such a role: we include this > restriction into LOCKDOWN_INTEGRITY and enable it on syzbot. > Though, unfortunately, we still don't enable it because it prohibits > access to debugfs, which is required for fuzzing. Need to ask lockdown > maintainers what they think about > LOCKDOWN_TEST_ONLY_DONT_ENABLE_IN_PROD_INTEGRITY which would whitelist > debugfs. Asked lockdown maintainers about adding this it lockdown and adding special mode for fuzzing: https://lore.kernel.org/all/CACT4Y+Z-9KCgKwkktvdJwNJZxxeA1f74zkP7KD6c=OmKXxXfjw@xxxxxxxxxxxxxx/