On Fri, Apr 05, 2024 at 01:19:35PM +0200, Christian Brauner wrote: > On Fri, Apr 05, 2024 at 08:51:35AM +0200, Christoph Hellwig wrote: > > On Thu, Apr 04, 2024 at 12:33:40PM +0300, Amir Goldstein wrote: > > > I don't follow what you are saying. > > > Which code is in non-starter violation? > > > kernfs for calling lookup_bdev() with internal of->mutex held? > > > > That is a huge problem, and has been causing endless annoying lockdep > > chains in the block layer for us. If we have some way to kill this > > the whole block layer would benefit. > > Why not just try and add a better resume api that forces resume to not > use a path argument neither for resume_file nor for writes to > /sys/power/resume. IOW, extend the protocol what can get written to > /sys/power/resume and then phase out the path argument. It'll take a > while but it's a possibly clean solution. In fact, just looking at this code with a naive set of eyes for a second: * There's early_lookup_bdev() which deals with PARTUUID, PARTLABEL, raw device number, and lookup based on /dev. No actual path lookup involved in that. * So the only interesting case is lookup_bdev() for /sys/power/suspend. That one takes arbitrary paths. But being realistic for a moment... How many people will specify a device path that's _not_ some variant of /dev/...? IOW, how many people will specify a device path that's not on devtmpfs or a symlink on devtmpfs? Probably almost no one. Containers come to mind ofc. But they can't mount devtmpfs so usually what they do is that they create a tmpfs mount at /dev and then bind-mount device nodes they need into there. But unprivileged containers cannot use suspend because that requires init_user_ns capabilities. And privileged containers that are allowed to hibernate and use custom paths seem extremly unlikely as well. So really, _naively_ it seems to me that one could factor out the /dev/* part of the device number parsing logic in early_lookup_bdev() and port resume_store() to use that first and only if that fails fall back to full lookup_bdev(). (Possibly combined with some sort of logging that the user should use /dev/... paths or at least a way to recognize that this arbitrary path stuff is actually used.) And citing from a chat with the hibernation maintainer in systemd: <brauner> So /sys/power/resume does systemd ever write anything other than a /dev/* path in to there? <maintainer> Hmm? You never do that? It only accepts devno. So that takes away one of the main users of this api. So I really suspect that arbitrary device path is unused in practice. Maybe I'm all wrong though.