On Sun, Apr 09, 2023 at 06:12:09PM -0400, Jeff Layton wrote: > On Sun, 2023-04-09 at 17:22 +0200, Christian Brauner wrote: > > On Fri, Apr 07, 2023 at 09:29:29AM -0400, Jeff Layton wrote: > > > > > > > > > > > > > > I would ditch the original proposal in favor of this 2-line patch shown here: > > > > > > > > > > > > > > https://lore.kernel.org/linux-integrity/a95f62ed-8b8a-38e5-e468-ecbde3b221af@xxxxxxxxxxxxx/T/#m3bd047c6e5c8200df1d273c0ad551c645dd43232 > > > > > > > > We should cool it with the quick hacks to fix things. :) > > > > > > > > > > Yeah. It might fix this specific testcase, but I think the way it uses > > > the i_version is "gameable" in other situations. Then again, I don't > > > know a lot about IMA in this regard. > > > > > > When is it expected to remeasure? If it's only expected to remeasure on > > > a close(), then that's one thing. That would be a weird design though. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ok, I think I get it. IMA is trying to use the i_version from the > > > > > > overlayfs inode. > > > > > > > > > > > > I suspect that the real problem here is that IMA is just doing a bare > > > > > > inode_query_iversion. Really, we ought to make IMA call > > > > > > vfs_getattr_nosec (or something like it) to query the getattr routine in > > > > > > the upper layer. Then overlayfs could just propagate the results from > > > > > > the upper layer in its response. > > > > > > > > > > > > That sort of design may also eventually help IMA work properly with more > > > > > > exotic filesystems, like NFS or Ceph. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Maybe something like this? It builds for me but I haven't tested it. It > > > > > looks like overlayfs already should report the upper layer's i_version > > > > > in getattr, though I haven't tested that either: > > > > > > > > > > -----------------------8<--------------------------- > > > > > > > > > > [PATCH] IMA: use vfs_getattr_nosec to get the i_version > > > > > > > > > > IMA currently accesses the i_version out of the inode directly when it > > > > > does a measurement. This is fine for most simple filesystems, but can be > > > > > problematic with more complex setups (e.g. overlayfs). > > > > > > > > > > Make IMA instead call vfs_getattr_nosec to get this info. This allows > > > > > the filesystem to determine whether and how to report the i_version, and > > > > > should allow IMA to work properly with a broader class of filesystems in > > > > > the future. > > > > > > > > > > Reported-by: Stefan Berger <stefanb@xxxxxxxxxxxxx> > > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > > > > > --- > > > > > > > > So, I think we want both; we want the ovl_copyattr() and the > > > > vfs_getattr_nosec() change: > > > > > > > > (1) overlayfs should copy up the inode version in ovl_copyattr(). That > > > > is in line what we do with all other inode attributes. IOW, the > > > > overlayfs inode's i_version counter should aim to mirror the > > > > relevant layer's i_version counter. I wouldn't know why that > > > > shouldn't be the case. Asking the other way around there doesn't > > > > seem to be any use for overlayfs inodes to have an i_version that > > > > isn't just mirroring the relevant layer's i_version. > > > > > > It's less than ideal to do this IMO, particularly with an IS_I_VERSION > > > inode. > > > > > > You can't just copy up the value from the upper. You'll need to call > > > inode_query_iversion(upper_inode), which will flag the upper inode for a > > > logged i_version update on the next write. IOW, this could create some > > > (probably minor) metadata write amplification in the upper layer inode > > > with IS_I_VERSION inodes. > > > > I'm likely just missing context and am curious about this so bear with me. Why > > do we need to flag the upper inode for a logged i_version update? Any required > > i_version interactions should've already happened when overlayfs called into > > the upper layer. So all that's left to do is for overlayfs' to mirror the > > i_version value after the upper operation has returned. > > > ovl_copyattr() - which copies the inode attributes - is always called after the > > operation on the upper inode has finished. So the additional query seems odd at > > first glance. But there might well be a good reason for it. In my naive > > approach I would've thought that sm along the lines of: > > > > diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c > > index 923d66d131c1..8b089035b9b3 100644 > > --- a/fs/overlayfs/util.c > > +++ b/fs/overlayfs/util.c > > @@ -1119,4 +1119,5 @@ void ovl_copyattr(struct inode *inode) > > inode->i_mtime = realinode->i_mtime; > > inode->i_ctime = realinode->i_ctime; > > i_size_write(inode, i_size_read(realinode)); > > + inode_set_iversion_raw(inode, inode_peek_iversion_raw(realinode)); > > } > > > > would've been sufficient. > > > > Nope, because then you wouldn't get any updates to i_version after that > point. > > Note that with an IS_I_VERSION inode we only update the i_version when > there has been a query since the last update. What you're doing above is > circumventing that mechanism. You'll get the i_version at the time of of > the ovl_copyattr, but there won't be any updates of it after that point > because the QUERIED bit won't end up being set on realinode. I get all that. But my understanding had been that the i_version value at the time of ovl_copyattr() would be correct. Because when ovl_copyattr() is called the expected i_version change will have been done in the relevant layer includig raising the QUERIED bit. Since the layers are not allowed to be changed outside of the overlayfs mount any change to them can only originate from overlayfs which would necessarily call ovl_copyattr() again. IOW, overlayfs would by virtue of its implementation keep the i_version value in sync. Overlayfs wouldn't even raise SB_I_VERSION. It would indeed just be a cache of i_version of the relevant layer. > > > > Since overlayfs' does explicitly disallow changes to the upper and lower trees > > while overlayfs is mounted it seems intuitive that it should just mirror the > > relevant layer's i_version. > > > > > > If we don't do this, then we should probably document that i_version doesn't > > have a meaning yet for the inodes of stacking filesystems. > > > > Trying to cache the i_version is counterproductive, IMO, at least with > an IS_I_VERSION inode. > > The problem is that a query against the i_version has a side-effect. It > has to (atomically) mark the inode for an update on the next change. > > If you try to cache that value, you'll likely end up doing more queries > than you really need to (because you'll need to keep the cache up to > date) and you'll have an i_version that will necessarily lag the one in > the upper layer inode. > > The whole point of the change attribute is to get the value as it is at > this very moment so we can check whether there have been changes. A > laggy value is not terribly useful. > > Overlayfs should just always call the upper layer's ->getattr to get the > version. I wouldn't even bother copying it up in the first place. Doing > so is just encouraging someone to try use the value in the overlayfs > inode, when they really need to go through ->getattr and get the one > from the upper layer. That seems reasonable to me. I read this as an agreeing with my earlier suggestion to document that i_version doesn't have a meaning for the inodes of stacking filesystems and that we should spell out that vfs_getattr()/->getattr() needs to be used to interact with i_version. We need to explain to subsystems such as IMA somwhere what the correct way to query i_version agnostically is; independent of filesystem implementation details. Looking at IMA, it queries the i_version directly without checking whether it's an IS_I_VERSION() inode first. This might make a difference. Afaict, filesystems that persist i_version to disk automatically raise SB_I_VERSION. I would guess that it be considered a bug if a filesystem would persist i_version to disk and not raise SB_I_VERSION. If so IMA should probably be made to check for IS_I_VERSION() and it will probably get that by switching to vfs_getattr_nosec().