On Thu, Apr 04, 2024 at 12:33:40PM +0300, Amir Goldstein wrote: > On Thu, Apr 4, 2024 at 11:21 AM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > On Thu, Apr 04, 2024 at 09:11:22AM +0100, Al Viro wrote: > > > On Thu, Apr 04, 2024 at 09:54:35AM +0300, Amir Goldstein wrote: > > > > > > > > In the lockdep dependency chain, overlayfs inode lock is taken > > > > before kernfs internal of->mutex, where kernfs (sysfs) is the lower > > > > layer of overlayfs, which is sane. > > > > > > > > With /sys/power/resume (and probably other files), sysfs also > > > > behaves as a stacking filesystem, calling vfs helpers, such as > > > > lookup_bdev() -> kern_path(), which is a behavior of a stacked > > > > filesystem, without all the precautions that comes with behaving > > > > as a stacked filesystem. > > > > > > No. This is far worse than anything stacked filesystems do - it's > > > an arbitrary pathname resolution while holding a lock. > > > It's not local. Just about anything (including automounts, etc.) > > > can be happening there and it pushes the lock in question outside > > > of *ALL* pathwalk-related locks. Pathname doesn't have to > > > resolve to anything on overlayfs - it can just go through > > > a symlink on it, or walk into it and traverse a bunch of .. > > > afterwards, etc. > > > > > > Don't confuse that with stacking - it's not even close. > > > You can't use that anywhere near overlayfs layers. > > > > > > Maybe isolate it into a separate filesystem, to be automounted > > > on /sys/power. And make anyone playing with overlayfs with > > > sysfs as a layer mount the damn thing on top of power/ in your > > > overlayfs. But using that thing as a part of layer is > > > a non-starter. > > I don't follow what you are saying. > Which code is in non-starter violation? > kernfs for calling lookup_bdev() with internal of->mutex held? > Overlayfs for allowing sysfs as a lower layer and calling > vfs_llseek(lower_sysfs_file,...) during copy up while ovl inode is held > for legit reasons (e.g. from ovl_rename())? > > > > > Incidentally, why do you need to lock overlayfs inode to call > > vfs_llseek() on the underlying file? It might (or might not) > > need to lock the underlying file (for things like ->i_size, > > etc.), but that will be done by ->llseek() instance and it > > would deal with the inode in the layer, not overlayfs one. > > We do not (anymore) lock ovl inode in ovl_llseek(), see: > b1f9d3858f72 ovl: use ovl_inode_lock in ovl_llseek() > but ovl inode is held in operations (e.g. ovl_rename) > which trigger copy up and call vfs_llseek() on the lower file. > > > > > Similar question applies to ovl_write_iter() - why do you > > need to hold the overlayfs inode locked during the call of > > backing_file_write_iter()? > > > > Not sure. This question I need to defer to Miklos. > I see in several places the pattern: > inode_lock(inode); > /* Update mode */ > ovl_copyattr(inode); > ret = file_remove_privs(file); > ... > /* Update size */ > ovl_file_modified(file); > ... > inode_unlock(inode); > > so it could be related to atomic remove privs and update mtime, > but possibly we could convert all of those inode_lock() to > ovl_inode_lock() (i.e. internal lock below vfs inode lock). > > [...] > > Consider the scenario when unlink() is called on that sucker > > during the write() that triggers that pathwalk. We have > > > > unlink: blocked on overlayfs inode of file, while holding the parent directory. > > write: holding the overlayfs inode of file and trying to resolve a pathname > > that contains .../power/suspend_stats/../../...; blocked on attempt to lock > > parent so we could do a lookup in it. > > This specifically cannot happen because sysfs is not allowed as an > upper layer only as a lower layer, so overlayfs itself will not be writing to > /sys/power/resume. I don't understand that part. If overlayfs uses /sys/power/ as a lower layer it can open and write to /sys/power/resume, no? Honestly, why don't you just block /sys/power from appearing in any layer in overlayfs? This seems like such a niche use-case that it's so unlikely that this will be used that I would just try and kill it. If you do it like Al suggested and switch it to an automount you get that for free. But I guess you can also just block it without that. (Frankly, I find it weird that sysfs is allowed as a layer in any case. I completely forgot about this. Imho, both procfs and sysfs should not be usable as a lower layer - procfs is, I know - and then only select parts should be like /sys/fs/cgroup or sm where I can see the container people somehow using this to mess with the cgroup tree or something.) > > > > > No llseek involved anywhere, kernfs of->mutex held, but not contended, > > deadlock purely on ->i_rwsem of overlayfs inodes. > > > > Holding overlayfs inode locked during the call of lookup_bdev() is really > > no-go. > > Yes. I see that, but how can this be resolved? > > If the specific file /sys/power/resume will not have FMODE_LSEEK, > then ovl_copy_up_file() will not try to skip_hole on it at all and the > llseek lock dependency will be averted. > > The question is whether opt-out of FMODE_LSEEK for /sys/power/resume > will break anything or if we should mark seq files in another way so that > overlayfs would not try to seek_hole on any of them categorically. > > Thanks, > Amir.