Hey Linus, /* Summary */ Last cycle we introduced support for mounting overlayfs on top of idmapped mounts. While looking into additional testing we realized that posix acls don't really work correctly with stacking filesystems on top of idmapped layers. We already knew what the fix were but it would require work that is more suitable for the merge window so we turned off posix acls for v5.19 for overlayfs on top of idmapped layers with Miklos routing my patch upstream in 72a8e05d4f66 ("Merge tag 'ovl-fixes-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs"). As announced this pull request here contains the work to support posix acls for overlayfs on top of idmapped layers. Since the posix acl fixes should use the new vfs{g,u}id_t work the associated branch has been merged in. (We sent a pull request for this earlier.) We've also pulled in Miklos pull request containing my patch to turn of posix acls on top of idmapped layers. This allowed us to avoid rebasing the branch which we didn't like because we were already at rc7 by then. Merging it in allows this branch to first fix posix acls and then to cleanly revert the temporary fix it brought in by commit 4a47c6385bb4 ("ovl: turn of SB_POSIXACL with idmapped layers temporarily"). So with this pull request merged no more follow up patches are needed. Note, the pull request contains the VFS work and the single overlayfs patch that is required. This was easier and clearer and doesn't require splitting this work over two different trees. We've announced this multiple times on the overlayfs mailing list and didn't here any objections. The last patch in this series adds Seth Forshee as a co-maintainer for idmapped mounts. Seth has been integral to all of this work and is also the main architect behind the filesystem idmapping work which ultimately made filesystems such as FUSE and overlayfs available in containers. He continues to be active in both development and review. I'm very happy he decided to help and he has my full trust. This increases the bus factor which is always great for work like this. I'm honestly very excited about this because I think in general we don't do great in the bringing on new maintainers department. ---------- What follows is an optional excursion into posix acls which provides a wider background. But the pull request message can be cut off here if not useful. Last we want to lose a few detailed words on ACL_{GROUP,USER} posix acls and how they violate type safety precautions we usually take and how this is aggravated by stacking filesystems. While the commit messages in this series contain detailed information we've spent some time going over posix acls (and fscaps) and it's worth pointing out a few details. Maybe we should add some of this as kernel documentation in the future. When we enter the kernel to get or set ACL_{GROUP,USER} posix acls the first thing the VFS does is to map the {g,u}ids stored as ACL_{GROUP,USER} posix acls into the caller's idmapping ("caller_idmapping"). This mapping or "translation" is performed by posix_acl_fix_xattr_from_user() for setxattr() and by posix_acl_fix_xattr_to_user() for getxattr(). The underlying helper called by posix_acl_fix_xattr_{from,to}_user() has a pecular implemation that is important to understand in order to understand ACL_{GROUP,USER} posix acls. The posix_acl_fix_xattr_{from,to}_user() helpers always use the initial_idmapping independent of whether or not the underlying filesystem that the posix acls were retrieved from has been mounted with a filesystem idmapping ("fs_idmapping") or not. Usually, when writing {g,u}ids passed in from userspace to the filesystem the kernel translates the {g,u}id_t into a k{g,u}id_t right at the userspace <-> VFS boundary. Then they are kept as k{g,u}id_t right until the moment where they need to be written to the backing store. For example, consider a regular chown() call. Right when entering the VFS the kernel will map the {g,u}id_t values passed in from userspace into a k{g,u}id_t and store it in an appropriate kernel internal struct iattr. Once the inode is changed and the change ultimately written back to disk the kernel will map the k{g,u}id_t back into a {g,u}id_t and write these values to disk taking the fs_idmapping into account. The kernel will take care to never conflate {g,u}id_t and k{g,u}id_t. It will especially not translate k{g,u}id_t back into {g,u}id_t unless at a relevant boundary. But for posix acls this isn't true. Going back to posix_acl_fix_xattr_from_user() the kernel does indeed first map the {g,u}id_t provided in ACL_{GROUP,USER} into k{g,u}id_t in the caller_idmapping. But then it immediately maps the just generated k{g,u}id_t back to a {g,u}id_t using the initial_idmapping. And strangly all that this will achieve is an identity translation that turns the k{g,u}id_t that were just generated back into {g,u}id_t with the samve value as the k{g,u}id_t that was just created. This is odd because the kernel should be adamant about translating between userspace {g,u}id_t and k{g,u}id_t only at either the userspace <-> VFS or the VFS <-> backing store boundary. The reason for this is simple and nasty. The VFS currently abuses the uapi struct posix_acl_xattr_entry to transport k{g,u}id_t from the VFS down to the filesystem and from the filesystem up to the VFS in the form of userspace {g,u}id_t. This has the consequence that the kernel intentionally breaks it's own type-safety requirements. The only reason that the kernel uses the initial_idmapping in posix_acl_fix_xattr_from_user() is to translate the just generated k{g,u}id_t back into {g,u}id_t of the same value so that the target value can be communicated to and from the filesystem without having to introduce an additional kernel internal struct that would contain k{g,u}id_t values and would perserve type safety. In other words, this is a classic hack. This has other consequences. For example, the callchains that posix acls are involved in look extremely weird because the VFS always uses the initial_idmapping when translating between the uapi struct posix_acl_xattr_entry and the kernel internal struct posix_acl even if the filesystem is mounted with an fs_idmapping. The cost of this is that it becomes very hard to understand what is going on especially when lacking proper documentation. It is unclear why the fs_idmapping is completely irrelevant even though the kernel translates between k{g,u}id_t and {g,u}id_t multiple times in a standard posix acl callchain. Just consider xfs for a second and roughly follow the conversion in getxattr(): xfs_get_acl(uid_t/gid_t -> kuid_t/kgid_t) posix_acl_from_xattr(kuid_t/kgid_t -> uid_t/gid_t) posix_acl_fix_xattr_to_user(uid_t/gid_t -> kuid_t/kgid_t -> uid_t/gid_t) when really it should be: xfs_get_acl(uid_t/gid_t -> kuid_t/kgid_t) posix_acl_fix_xattr_to_user(kuid_t/kgid_t -> uid_t/gid_t) This is a whole lot of subtlety and ripe for confusion and bugs and bound to confuse developers and it is certain to wreak havoc when bringing stacking filesystems like overlayfs into the mix. Overlayfs callchains are more complicated because certain functions will be hit twice in an overlayfs callchain: once for the overlayfs layer and once for the lower layer: sys_setxattr() -> vfs_setxattr() /* overlayfs */ -> vfs_setxattr() /* lower fs such as xfs, btrfs, FUSE */ The subtle type confusions are even more problematic now because they have to be reasoned about in scenarios where two filesystems are involved. That's not taking into account that the type confusion has to be kept around for an even longer time and passed into even more complicated and deep callchains. Now when an idmapped mount is in the mix the mnt_idmapping needs to be taken into account when ownership information is needed. This comes down to undoing the fs_idmapping and applying the mnt_idmapping. The idmapped mount helpers (e.g., make_vfs{g,u}id() and from_vfs{g,u}id()) express the dependency between the fs_idmapping and the mnt_idmapping and thus suggest even more strongly that the fs_idmapping must be used whenever they are called. The problem is that the type safety subversion in ACL_{GROUP,USER} now becomes even more problematic. First, because the developer needs to be aware that the fs_idmapping is completely irrelevant in posix_acl_fix_xattr_{from,to}_user() and thus is also irrelevant for idmapped mount helpers. That itself is already a feat because the developer needs to be aware of the callchain pecularities of posix acls. That problem would not exist if the type passed down into or up from the filesystem would stay a k{g,u}id_t or vfs{g,u}id_t in the newer callchains right until the moment where we transition between the VFS <-> userspace or backingstore <-> VFS boundary. Second, the mnt_idmapping is a property of the VFS that is tied to a mount created for a specific filesystem. So the mnt_idmapping cannot be taken into account for ACL_{GROUP,USER} posix acls in some early place of a stacking callchain that overlayfs generates. Since they are a property of each layer in a filesystem stack generated by overlayfs they need to be taken into account at specific places in the callchain. But before this branch we did: sys_setxattr() -> do_setxattr() -> setxattr_convert() -> posix_acl_fix_xattr_from_user() /* translation step */ -> vfs_setxattr() /* overlayfs */ -> vfs_setxattr() /* lower fs such as xfs, btrfs, FUSE */ We can see that the translation happens outside vfs_setxattr() in do_setxattr() including taking the mnt_idmapping into account. But that means the mnt_idmapping of the lower layer cannot be taken into account as the translation happens only in the top layer of the filesystem stack. In other words, the translation happens for the overlayfs filesystem but obviously the overlayfs filesystem cannot be on an idmapped mount as we don't support that and probably never will. So the translation step involving the mnt_idmapping needs to happen in vfs_{g,s}etxattr() so that the VFS can take the mnt_idmapping into account for each filesystem it is called upon. All of this should be addressed cleaning up the posix acl codepaths and we plan to do so. A final note on fs_idmappings. A little while ago we received a few questions about the role of the fs_idmapping for posix acls as we were fixing the overlayfs posix acl handling on top of idmapped layers where we explained that the mnt_idmapping needs to be handled in vfs_{g,s}etxattr(). The question was why posix_acl_fix_xattr_{from,to}_user() couldn't just be moved into vfs_{g,s}etxattr() and simply taking the fs_idmapping into account in these helpers. This questions shows how dangerous the type subversion is. The fact is that the fs_idmapping is entirely irrelevant for posix_acl_fix_xattr_{from,to}_user() as we've seen. The reason why it seems relevant is that we convert between a k{g,u}id_t and {g,u}id_t but that conversion is entirely based on a hack to pass k{g,u}id_t in a uapi struct posix_acl_xattr_entry as {g,u}id_t as explained above. For a stacking filesystem the fs_idmapping of the top layer needs to be transparent with respect to the ownership of things such as inode->i_{g,u}id, ACL_{GROUP,USER} and posix acls and filesystem capabilities. The general idea is that right at the userspace <-> VFS boundary any {g,u}id_t passed in from userspace is immediately turned into a k{g,u}id_t. The k{g,u}id_t is kept and only translated into a {g,u}id_t at the VFS <-> backing store boundary, i.e., when a write to a backing store occurs. The same applies for the backing store <-> VFS boundary where the {g,u}id_t stored in the backing store needs to be turned into a k{g,u}id_t and only ever converted back into {g,u}id_t at the VFS <-> userspace boundary. A great example to illustrate the point is a simple chown() operation. In chown_common() the passed in {g,u}id_t is turned into k{g,u}id_t (vfs{g,u}id_t on newer kernels) in the caller_idmapping. The notify_change() helper is then called to determine whether the caller is allowed to perform the ownership change. Part of that is validating that the requested k{g,u}id_t has a mapping in the fs_idmapping. If it does we know we can translate the k{g,u}id_t back into {g,u}id_t when writing to the backing store later on. But nowhere do we subvert the type system by converting that k{g,u}id_t back into {g,u}id_t let alone storing it anwywhere to be operated on later. For overlayfs the chown() operation is similar just that permission checking is performed twice as notify_change() is called once on the overlayfs layer and once on the lower layer. But note that the requested k{g,u}id_t never takes the fs_idmapping or caller_idmapping of the mounter of the overlayfs filesystem into account. Similarly, when we report ownership information of a file the lower layer will fill in struct kstat with the ownership information based on the fs_idmapping and mnt_idmapping of the relevant lower layer. But the fs_idmapping and caller_idmapping of the overlayfs mounter are entirely irrelevant for this. This is crucial to notice as taking the fs_idmapping or caller_idmapping for overlayfs into account would mean that overlayfs changes the ownership of the underlying filesystem objects when mounted inside of a user namespace. That would be a pretty nasty security issue. But this is what moving posix_acl_fix_xattr_{from,to}_user() into vfs_{g,s}etxattr() would amount to. If these helpers were relocated and would be changed to take the fs_idmapping into account then posix acl ownership would be changed by the overlayfs layer causing wrong values for ACL_{GROUP,USER} to be stored in the backing store of the underlying filesystem. Fwiw, fscaps currently work that way by performing the full translation twice for each layer taking the caller_idmapping and the fs_idmapping into account twice. This only works correctly because overlayfs currently guarantees that the caller_idmapping used to map up the fscaps from the lower fs_idmapping and the fs_idmapping of overlayfs itself are identical. So the two translations invert each other. But that's only guaranteed because of the override_creds() call. If that were turned off though (patchsets like this have been floated around) this would all break. Independent of this it is opaque and hard to understand. Seth is currently working on a patchset to fix this. /* Testing */ All patches are based on v5.19-rc5 and have been sitting in linux-next. No build failures or warnings were observed and fstests and selftests have seen no regressions. /* Conflicts */ At the time of creating this PR no merge conflicts were reported from linux-next and no merge conflicts showed up doing a test-merge with current mainline. The following changes since commit 88084a3df1672e131ddc1b4e39eeacfd39864acf: Linux 5.19-rc5 (2022-07-03 15:39:28 -0700) are available in the Git repository at: git@xxxxxxxxxxxxxxxxxxx:pub/scm/linux/kernel/git/brauner/linux tags/fs.idmapped.overlay.acl.v5.20 for you to fetch changes up to ba40a57ff08bf606135866bfe5fddc572089ac16: Add Seth Forshee as co-maintainer for idmapped mounts (2022-07-27 10:13:44 +0200) Please consider pulling these changes from the signed fs.idmapped.overlay.acl.v5.20 tag. Thanks! Christian ---------------------------------------------------------------- fs.idmapped.overlay.acl.v5.20 ---------------------------------------------------------------- Christian Brauner (21): mnt_idmapping: add vfs{g,u}id_t fs: add two type safe mapping helpers fs: use mount types in iattr fs: introduce tiny iattr ownership update helpers fs: port to iattr ownership update helpers quota: port quota helpers mount ids security: pass down mount idmapping to setattr hook attr: port attribute changes to new types attr: fix kernel doc fs: port HAS_UNMAPPED_ID() to vfs{g,u}id_t mnt_idmapping: use new helpers in mapped_fs{g,u}id() mnt_idmapping: align kernel doc and parameter order ovl: turn of SB_POSIXACL with idmapped layers temporarily Merge tag 'ovl-fixes-5.19-rc7' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into fs.idmapped.overlay.acl mnt_idmapping: add vfs[g,u]id_into_k[g,u]id() acl: move idmapped mount fixup into vfs_{g,s}etxattr() acl: port to vfs{g,u}id_t acl: make posix_acl_clone() available to overlayfs ovl: handle idmappings in ovl_get_acl() Revert "ovl: turn of SB_POSIXACL with idmapped layers temporarily" Add Seth Forshee as co-maintainer for idmapped mounts Seth Forshee (1): mnt_idmapping: return false when comparing two invalid ids MAINTAINERS | 1 + fs/attr.c | 74 +++++---- fs/ext2/inode.c | 8 +- fs/ext4/inode.c | 14 +- fs/f2fs/file.c | 22 +-- fs/f2fs/recovery.c | 10 +- fs/fat/file.c | 9 +- fs/jfs/file.c | 4 +- fs/ksmbd/vfs.c | 2 +- fs/ksmbd/vfs.h | 2 +- fs/ocfs2/file.c | 2 +- fs/open.c | 60 ++++++-- fs/overlayfs/copy_up.c | 4 +- fs/overlayfs/inode.c | 87 ++++++++++- fs/overlayfs/overlayfs.h | 15 +- fs/posix_acl.c | 168 ++++++++++++++------- fs/quota/dquot.c | 17 ++- fs/reiserfs/inode.c | 4 +- fs/xattr.c | 25 +++- fs/xfs/xfs_iops.c | 14 +- fs/zonefs/super.c | 2 +- include/linux/evm.h | 6 +- include/linux/fs.h | 140 ++++++++++++++++- include/linux/mnt_idmapping.h | 305 ++++++++++++++++++++++++++++++++++---- include/linux/posix_acl.h | 1 + include/linux/posix_acl_xattr.h | 34 +++-- include/linux/quotaops.h | 15 +- include/linux/security.h | 8 +- include/linux/xattr.h | 2 +- security/integrity/evm/evm_main.c | 12 +- security/security.c | 5 +- 31 files changed, 815 insertions(+), 257 deletions(-)