On Sat, Dec 19, 2020 at 8:53 AM Ian Kent <raven@xxxxxxxxxx> wrote: > > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote: > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@xxxxxxxxxx> wrote: > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote: > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@xxxxxxxxxx> > > > > wrote: > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > > > > > Hello, > > > > > > > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > > > > > What could be done is to make the kernfs node attr_mutex > > > > > > > > a pointer and dynamically allocate it but even that is > > > > > > > > too > > > > > > > > costly a size addition to the kernfs node structure as > > > > > > > > Tejun has said. > > > > > > > > > > > > > > I guess the question to ask is, is there really a need to > > > > > > > call kernfs_refresh_inode() from functions that are usually > > > > > > > reading/checking functions. > > > > > > > > > > > > > > Would it be sufficient to refresh the inode in the > > > > > > > write/set > > > > > > > operations in (if there's any) places where things like > > > > > > > setattr_copy() is not already called? > > > > > > > > > > > > > > Perhaps GKH or Tejun could comment on this? > > > > > > > > > > > > My memory is a bit hazy but invalidations on reads is how > > > > > > sysfs > > > > > > namespace is > > > > > > implemented, so I don't think there's an easy around that. > > > > > > The > > > > > > only > > > > > > thing I > > > > > > can think of is embedding the lock into attrs and doing xchg > > > > > > dance > > > > > > when > > > > > > attaching it. > > > > > > > > > > Sounds like your saying it would be ok to add a lock to the > > > > > attrs structure, am I correct? > > > > > > > > > > Assuming it is then, to keep things simple, use two locks. > > > > > > > > > > One global lock for the allocation and an attrs lock for all > > > > > the > > > > > attrs field updates including the kernfs_refresh_inode() > > > > > update. > > > > > > > > > > The critical section for the global lock could be reduced and > > > > > it > > > > > changed to a spin lock. > > > > > > > > > > In __kernfs_iattrs() we would have something like: > > > > > > > > > > take the allocation lock > > > > > do the allocated checks > > > > > assign if existing attrs > > > > > release the allocation lock > > > > > return existing if found > > > > > othewise > > > > > release the allocation lock > > > > > > > > > > allocate and initialize attrs > > > > > > > > > > take the allocation lock > > > > > check if someone beat us to it > > > > > free and grab exiting attrs > > > > > otherwise > > > > > assign the new attrs > > > > > release the allocation lock > > > > > return attrs > > > > > > > > > > Add a spinlock to the attrs struct and use it everywhere for > > > > > field updates. > > > > > > > > > > Am I on the right track or can you see problems with this? > > > > > > > > > > Ian > > > > > > > > > > > > > umm, we update the inode in kernfs_refresh_inode, right?? So I > > > > guess > > > > the problem is how can we protect the inode when > > > > kernfs_refresh_inode > > > > is called, not the attrs?? > > > > > > But the attrs (which is what's copied from) were protected by the > > > mutex lock (IIUC) so dealing with the inode attributes implies > > > dealing with the kernfs node attrs too. > > > > > > For example in kernfs_iop_setattr() the call to setattr_copy() > > > copies > > > the node attrs to the inode under the same mutex lock. So, if a > > > read > > > lock is used the copy in kernfs_refresh_inode() is no longer > > > protected, > > > it needs to be protected in a different way. > > > > > > > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem for > > .setattr but > > no lock for .getattr (misdocumented?? sometimes they have as you've > > found out)? > > What does it protect against?? Because .permission does a similar > > thing > > here -- updating inode attributes, the goal is to provide the same > > protection level > > for .permission as for .setattr, am I right??? > > As far as the documentation goes that's probably my misunderstanding > of it. > > It does happen that the VFS makes assumptions about how call backs > are meant to be used. > > Read like call backs, like .getattr() and .permission() are meant to > be used, well, like read like functions so the VFS should be ok to > take locks or not based on the operation context at hand. > > So it's not about the locking for these call backs per se, it's about > the context in which they are called. > > For example, in link_path_walk(), at the beginning of the component > lookup loop (essentially for the containing directory at that point), > may_lookup() is called which leads to a call to .permission() without > any inode lock held at that point. > > But file opens (possibly following a path walk to resolve a path) > are different. > > For example, do_filp_open() calls path_openat() which leads to a > call to open_last_lookups(), which leads to a call to .permission() > along the way. And in this case there are two contexts, an open() > create or one without create, the former needing the exclusive inode > lock and the later able to use the shared lock. > > So it's about the locking needed for the encompassing operation that > is being done not about those functions specifically. > > TBH the VFS is very complex and Al has a much, much better > understanding of it than I do so he would need to be the one to answer > whether it's the file systems responsibility to use these calls in the > way the VFS expects. > > My belief is that if a file system needs to use a call back in a way > that's in conflict with what the VFS expects it's the file systems' > responsibility to deal with the side effects. > Thanks for clarifying. Ian. Yeah, it's complex and confusing and it's very hard to spot lock context by reading VFS code. I put code like this: if (lockdep_is_held_type(&inode->i_rwsem, -1)) { if (lockdep_is_held_type(&inode->i_rwsem, 0)) { pr_warn("kernfs iop_permission inode WRITE lock is held"); } else if (lockdep_is_held_type(&inode->i_rwsem, 1)) { pr_warn("kernfs iop_permission inode READ lock is held"); } } else { pr_warn("kernfs iop_permission inode lock is NOT held"); } in both .permission & .getattr. Then I do some open/read/write/close to /sys, interestingly, all log outputs suggest they are in WRITE lock context. and I put dump_stack() to the lock-is-held if branch, it prints a lot of following context: [ 481.795445] CPU: 0 PID: 1 Comm: systemd Not tainted 5.10.0-dirty #25 [ 481.795446] Hardware name: Parallels Software International Inc. Parallels Virtual Platform/Parallels Virtual Platform, BIOS 15.1.5 (47309) 09/26/2020 [ 481.795446] Call Trace: [ 481.795448] dump_stack (lib/dump_stack.c:120) [ 481.795450] kernfs_iop_permission (fs/kernfs/inode.c:295) [ 481.795452] inode_permission.part.0 (fs/namei.c:398 fs/namei.c:463) [ 481.795454] may_open (fs/namei.c:2875) [ 481.795456] path_openat (fs/namei.c:3250 fs/namei.c:3369) [ 481.795458] ? ___sys_sendmsg (net/socket.c:2411) [ 481.795460] ? preempt_count_add (./include/linux/ftrace.h:821 kernel/sched/core.c:4166 kernel/sched/core.c:4163 kernel/sched/core.c:4191) [ 481.795461] ? sock_poll (net/socket.c:1265) [ 481.795463] do_filp_open (fs/namei.c:3396) [ 481.795466] ? __check_object_size (mm/usercopy.c:240 mm/usercopy.c:286 mm/usercopy.c:256) [ 481.795469] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:102 ./include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) [ 481.795470] do_sys_openat2 (fs/open.c:1168) [ 481.795472] __x64_sys_openat (fs/open.c:1195) [ 481.795473] do_syscall_64 (arch/x86/entry/common.c:46) [ 481.795475] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:127) [ 481.795476] RIP: 0033:0x7f6b31d69c94 Surprisingly, I didn't see the lock holding code along the path. thanks, fox