Hi Christian, This is another step in the multi step plan to reduce the vulnerability of filesystems code to overlayfs backing files whose f_path and f_inode are not on the same sb. History: -------- When overlayfs was first merged, overlayfs files of regular files and directories, the ones that are installed in file table, had a "fake" path, namely, f_path is the overlayfs path and f_inode is the "real" inode on the underlying filesystem. At that time, all filesystem syscalls, generic vfs code and any filesystem code (among the fs that are supported as overlayfs underlying layers) had to be aware of the possibility of a file with "fake" path and had to use the helper file_dentry() to get the "real" dentry and the helper locks_inode() to get to the overlayfs inode. At that time, there was no way to get the "real" path. This situation was fragile because many filesystem developers were not aware of having to deal with file with "fake" path. Admittedly, it wasn't very well documented either. Backing files: -------------- The first big step to reduce the impact of files with fake path was in v4.19 that introduced d1d04ef8572b ("ovl: stack file ops"). Since v4.19, f_inode of the overlayfs files that are installed in file table are overlayfs inodes, so those files are no longer odd. As a result of this change, a lot of syscalls code, which take fd as an argument, is no longer exposed to files with "fake" path and the helper file_locks() was no longer needed. However, since v4.19, overlayfs uses internal backing files with "fake" path, so generic vfs code that overlayfs calls with the backing files and filesystem code still needs to be aware of files with "fake" path. The one place where the internal backing files are "exposed" to users is when users mmap overlayfs files, the backing file (and not the overlayfs file) is stored in vm_file. The reason that overlayfs backing files use a "fake" path is so that /proc/<pid>/maps will disaply the overlayfs file path to users. Recently: --------- In v6.5, we took another small step by introducing of the backing_file container and the file_real_path() helper. This change allowed vfs and filesystem code to get the "real" path of an overlayfs backing file. With this change, we were able to make fsnotify work correctly and report events on the "real" filesystem objects that were accessed via overlayfs. This method works fine, but it still leaves the vfs vulnerable to new code that is not aware of files with fake path. A recent example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version"). This commit uses direct referencing to f_path in IMA code that otherwise uses file_inode() and file_dentry() to reference the filesystem objects that it is measuring. Coming up: ---------- This patch set actually implements my initial proposal [1] that was proposed for v6.5 - instead of having filesystem code opt-in to get the "real" path, have generic code opt-in for the "fake" path in the few places that it is needed. It is possible that I missed a few places that need opt-in for "fake" path, but in most likelihood, a missed opt-in to "fake" path would just result in printing the "real" path to user and if %pD is used for printing, that will probably make no difference anyway. Is it far more likely that new filesystems code that does not use the file_dentry() and file_real_path() helpers will end up causing crashes or averting LSM/audit rules if we keep the "fake" path exposed by default. Future: ------- This change already makes file_dentry() moot, but for now we did not change this helper just added a WARN_ON() in ovl_d_real() to catch if we have made any wrong assumptions. After the dust settles on this change, we can make file_dentry() a plain accessor and we can drop the inode argument to ->d_real(). Testing: -------- As usual, I ran fstests with overlayfs over xfs and all the LTS tests that test fsnotify + overlayfs. I have limited testing ability for audit/IMA/LSM modules, so we will have to rely on other people and bots testing of linux-next. Syzbot has been known to be very obsessive about reporting bugs of overlayfs + IMA. Merging: -------- The branch that I tested [2] is based on both the stable vfs.mount.write branch and vfs.misc of the moment. If you agree to stage these patches in vfs tree, they would need to be based on both vfs.mount.write and the file_free_rcu patches. So perhaps split the file_free_rcu patches out of vfs.misc and into a vfs.file topic branch that is based on the stable vfs.mount.write branch and add my patches on top? I don't need vfs.file to be stable, I understand that the file_free_rcu() patches may still change, but since I would like to test overlayfs-next together with the "fake" path patches, it would be nice if the vfs.file branch would be more stable than vfs.misc usually is. Thanks, Amir. Changes since v1: - backing_file (fake_file rebranded) container already merged - fsnotify support for backing files already merged - Take mnt_writers refcount on the "real" path - Rename file_fake_path() => file_user_path() - No need to use file_user_path() for exe_file path access - No need to use file_user_path() in audit/LSM code - Added file_user_path() in some tracing/debugging code [1] https://lore.kernel.org/r/20230609073239.957184-1-amir73il@xxxxxxxxx/ [2] https://github.com/amir73il/linux/commits/ovl_fake_path Amir Goldstein (3): fs: get mnt_writers count for an open backing file's real path fs: create helper file_user_path() for user displayed mapped file path fs: store real path instead of fake path in backing file f_path arch/arc/kernel/troubleshoot.c | 3 +- fs/file_table.c | 12 ++++---- fs/internal.h | 15 ++++++++-- fs/open.c | 55 +++++++++++++++++++++++----------- fs/overlayfs/super.c | 11 ++++++- fs/proc/base.c | 2 +- fs/proc/nommu.c | 2 +- fs/proc/task_mmu.c | 4 +-- fs/proc/task_nommu.c | 2 +- include/linux/fs.h | 27 ++++++++++------- include/linux/fsnotify.h | 3 +- kernel/trace/trace_output.c | 2 +- 12 files changed, 92 insertions(+), 46 deletions(-) -- 2.34.1