On Fri, Apr 5, 2024 at 1:47 PM Christian Brauner <brauner@xxxxxxxxxx> wrote: > > On Thu, Apr 04, 2024 at 12:33:40PM +0300, Amir Goldstein wrote: > > On Thu, Apr 4, 2024 at 11:21 AM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > > > On Thu, Apr 04, 2024 at 09:11:22AM +0100, Al Viro wrote: > > > > On Thu, Apr 04, 2024 at 09:54:35AM +0300, Amir Goldstein wrote: > > > > > > > > > > In the lockdep dependency chain, overlayfs inode lock is taken > > > > > before kernfs internal of->mutex, where kernfs (sysfs) is the lower > > > > > layer of overlayfs, which is sane. > > > > > > > > > > With /sys/power/resume (and probably other files), sysfs also > > > > > behaves as a stacking filesystem, calling vfs helpers, such as > > > > > lookup_bdev() -> kern_path(), which is a behavior of a stacked > > > > > filesystem, without all the precautions that comes with behaving > > > > > as a stacked filesystem. > > > > > > > > No. This is far worse than anything stacked filesystems do - it's > > > > an arbitrary pathname resolution while holding a lock. > > > > It's not local. Just about anything (including automounts, etc.) > > > > can be happening there and it pushes the lock in question outside > > > > of *ALL* pathwalk-related locks. Pathname doesn't have to > > > > resolve to anything on overlayfs - it can just go through > > > > a symlink on it, or walk into it and traverse a bunch of .. > > > > afterwards, etc. > > > > > > > > Don't confuse that with stacking - it's not even close. > > > > You can't use that anywhere near overlayfs layers. > > > > > > > > Maybe isolate it into a separate filesystem, to be automounted > > > > on /sys/power. And make anyone playing with overlayfs with > > > > sysfs as a layer mount the damn thing on top of power/ in your > > > > overlayfs. But using that thing as a part of layer is > > > > a non-starter. > > > > I don't follow what you are saying. > > Which code is in non-starter violation? > > kernfs for calling lookup_bdev() with internal of->mutex held? > > Overlayfs for allowing sysfs as a lower layer and calling > > vfs_llseek(lower_sysfs_file,...) during copy up while ovl inode is held > > for legit reasons (e.g. from ovl_rename())? > > > > > > > > Incidentally, why do you need to lock overlayfs inode to call > > > vfs_llseek() on the underlying file? It might (or might not) > > > need to lock the underlying file (for things like ->i_size, > > > etc.), but that will be done by ->llseek() instance and it > > > would deal with the inode in the layer, not overlayfs one. > > > > We do not (anymore) lock ovl inode in ovl_llseek(), see: > > b1f9d3858f72 ovl: use ovl_inode_lock in ovl_llseek() > > but ovl inode is held in operations (e.g. ovl_rename) > > which trigger copy up and call vfs_llseek() on the lower file. > > > > > > > > Similar question applies to ovl_write_iter() - why do you > > > need to hold the overlayfs inode locked during the call of > > > backing_file_write_iter()? > > > > > > > Not sure. This question I need to defer to Miklos. > > I see in several places the pattern: > > inode_lock(inode); > > /* Update mode */ > > ovl_copyattr(inode); > > ret = file_remove_privs(file); > > ... > > /* Update size */ > > ovl_file_modified(file); > > ... > > inode_unlock(inode); > > > > so it could be related to atomic remove privs and update mtime, > > but possibly we could convert all of those inode_lock() to > > ovl_inode_lock() (i.e. internal lock below vfs inode lock). > > > > [...] > > > Consider the scenario when unlink() is called on that sucker > > > during the write() that triggers that pathwalk. We have > > > > > > unlink: blocked on overlayfs inode of file, while holding the parent directory. > > > write: holding the overlayfs inode of file and trying to resolve a pathname > > > that contains .../power/suspend_stats/../../...; blocked on attempt to lock > > > parent so we could do a lookup in it. > > > > This specifically cannot happen because sysfs is not allowed as an > > upper layer only as a lower layer, so overlayfs itself will not be writing to > > /sys/power/resume. > > I don't understand that part. If overlayfs uses /sys/power/ as a lower > layer it can open and write to /sys/power/resume, no? > > Honestly, why don't you just block /sys/power from appearing in any > layer in overlayfs? This seems like such a niche use-case that it's so > unlikely that this will be used that I would just try and kill it. I do not want to special case /sys/power in overlayfs. > > If you do it like Al suggested and switch it to an automount you get Not important enough IMO to make this change. > that for free. But I guess you can also just block it without that. > > (Frankly, I find it weird that sysfs is allowed as a layer in any case. I > completely forgot about this. Imho, both procfs and sysfs should not be > usable as a lower layer - procfs is, I know - and then only select parts > should be like /sys/fs/cgroup or sm where I can see the container people > somehow using this to mess with the cgroup tree or something.) > I do not know if using sysfs as a lower layer is an important use case, but I have a feeling that people already may do it, so I cannot regress it without a good reason. Al's suggestion to annotate writable kernfs files as a different class from readonly kernfs files seems fine by me to silence lockdep false positive. I will try to feed this solution to syzbot. Thanks, Amir.