On Thu, Apr 4, 2024 at 11:21 AM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > On Thu, Apr 04, 2024 at 09:11:22AM +0100, Al Viro wrote: > > On Thu, Apr 04, 2024 at 09:54:35AM +0300, Amir Goldstein wrote: > > > > > > In the lockdep dependency chain, overlayfs inode lock is taken > > > before kernfs internal of->mutex, where kernfs (sysfs) is the lower > > > layer of overlayfs, which is sane. > > > > > > With /sys/power/resume (and probably other files), sysfs also > > > behaves as a stacking filesystem, calling vfs helpers, such as > > > lookup_bdev() -> kern_path(), which is a behavior of a stacked > > > filesystem, without all the precautions that comes with behaving > > > as a stacked filesystem. > > > > No. This is far worse than anything stacked filesystems do - it's > > an arbitrary pathname resolution while holding a lock. > > It's not local. Just about anything (including automounts, etc.) > > can be happening there and it pushes the lock in question outside > > of *ALL* pathwalk-related locks. Pathname doesn't have to > > resolve to anything on overlayfs - it can just go through > > a symlink on it, or walk into it and traverse a bunch of .. > > afterwards, etc. > > > > Don't confuse that with stacking - it's not even close. > > You can't use that anywhere near overlayfs layers. > > > > Maybe isolate it into a separate filesystem, to be automounted > > on /sys/power. And make anyone playing with overlayfs with > > sysfs as a layer mount the damn thing on top of power/ in your > > overlayfs. But using that thing as a part of layer is > > a non-starter. I don't follow what you are saying. Which code is in non-starter violation? kernfs for calling lookup_bdev() with internal of->mutex held? Overlayfs for allowing sysfs as a lower layer and calling vfs_llseek(lower_sysfs_file,...) during copy up while ovl inode is held for legit reasons (e.g. from ovl_rename())? > > Incidentally, why do you need to lock overlayfs inode to call > vfs_llseek() on the underlying file? It might (or might not) > need to lock the underlying file (for things like ->i_size, > etc.), but that will be done by ->llseek() instance and it > would deal with the inode in the layer, not overlayfs one. We do not (anymore) lock ovl inode in ovl_llseek(), see: b1f9d3858f72 ovl: use ovl_inode_lock in ovl_llseek() but ovl inode is held in operations (e.g. ovl_rename) which trigger copy up and call vfs_llseek() on the lower file. > > Similar question applies to ovl_write_iter() - why do you > need to hold the overlayfs inode locked during the call of > backing_file_write_iter()? > Not sure. This question I need to defer to Miklos. I see in several places the pattern: inode_lock(inode); /* Update mode */ ovl_copyattr(inode); ret = file_remove_privs(file); ... /* Update size */ ovl_file_modified(file); ... inode_unlock(inode); so it could be related to atomic remove privs and update mtime, but possibly we could convert all of those inode_lock() to ovl_inode_lock() (i.e. internal lock below vfs inode lock). [...] > Consider the scenario when unlink() is called on that sucker > during the write() that triggers that pathwalk. We have > > unlink: blocked on overlayfs inode of file, while holding the parent directory. > write: holding the overlayfs inode of file and trying to resolve a pathname > that contains .../power/suspend_stats/../../...; blocked on attempt to lock > parent so we could do a lookup in it. This specifically cannot happen because sysfs is not allowed as an upper layer only as a lower layer, so overlayfs itself will not be writing to /sys/power/resume. > > No llseek involved anywhere, kernfs of->mutex held, but not contended, > deadlock purely on ->i_rwsem of overlayfs inodes. > > Holding overlayfs inode locked during the call of lookup_bdev() is really > no-go. Yes. I see that, but how can this be resolved? If the specific file /sys/power/resume will not have FMODE_LSEEK, then ovl_copy_up_file() will not try to skip_hole on it at all and the llseek lock dependency will be averted. The question is whether opt-out of FMODE_LSEEK for /sys/power/resume will break anything or if we should mark seq files in another way so that overlayfs would not try to seek_hole on any of them categorically. Thanks, Amir.