On Thu, Feb 29, 2024 at 10:46:48AM +0100, Christian Brauner wrote: > On Wed, Feb 28, 2024 at 11:27:05PM -0800, Linus Torvalds wrote: > > On Wed, 28 Feb 2024 at 23:20, Linus Torvalds > > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > > - take the lock exclusively if O_APPEND or if it *looks* like you > > > might extend the file size. > > > > > > - otherwise, take the shared lock, and THEN RE-CHECK. The file size > > > might have changed, so now you need to double-check that you're really > > > not going to extend the size of the file, and if you are, you need to > > > go back and take the inode lock exclusively after all. > > > > Same goes for the suid/sgid checks. You need to take the inode lock > > shared to even have the i_mode be stable, and at that point you might > > decide "oh, I need to clear suid/sgid after all" and have to go drop > > the lock and retake it for exclusive access after all. > > I agree. And this is how we've done it in xfs as well. The inode lock is > taken shared, then check for IS_NOSEC(inode) and if that's true keep the > shared lock, otherwise upgrade to an exclusive lock. Note that this > whole check also checks filesystem capability xattrs. > > I think you'd also get potential security issues or at least very subtle > behavior. Someone could make a binary setuid and a concurrent write > happens. chmod is taking the exclusive lock and there's a concurrent > write going on changing the contents to something unsafe without being > forced to remove the set*id bits. > > Similar for {g,u}id changes. Someone starts a chown() taking inode lock > while someone issues a concurrent write. The chown changes the group > ownership so that a writer would be forced to drop the sgid bit. But the > writer proceeds keeping that bit. That's the kind of race that's not actually a race - if the syscalls being invoked at the same time, it's fine if the result is as if the write happened first, then the chown. For there to be an actual race one of two things would have to be the case: - the chown/chmod path would have to be dependent on the on the data the write path is changing, or - there'd have to be a problem wnith the write path seeing inconsistent state mid modification But I can't claim 100% the second case isn't a factor, I don't know the security side of things as well, and the XFS way sounds nice and clean.