On Sat, 2020-12-19 at 15:47 +0800, Fox Chen wrote: > On Sat, Dec 19, 2020 at 8:53 AM Ian Kent <raven@xxxxxxxxxx> wrote: > > On Fri, 2020-12-18 at 21:20 +0800, Fox Chen wrote: > > > On Fri, Dec 18, 2020 at 7:21 PM Ian Kent <raven@xxxxxxxxxx> > > > wrote: > > > > On Fri, 2020-12-18 at 16:01 +0800, Fox Chen wrote: > > > > > On Fri, Dec 18, 2020 at 3:36 PM Ian Kent <raven@xxxxxxxxxx> > > > > > wrote: > > > > > > On Thu, 2020-12-17 at 10:14 -0500, Tejun Heo wrote: > > > > > > > Hello, > > > > > > > > > > > > > > On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote: > > > > > > > > > What could be done is to make the kernfs node > > > > > > > > > attr_mutex > > > > > > > > > a pointer and dynamically allocate it but even that > > > > > > > > > is > > > > > > > > > too > > > > > > > > > costly a size addition to the kernfs node structure > > > > > > > > > as > > > > > > > > > Tejun has said. > > > > > > > > > > > > > > > > I guess the question to ask is, is there really a need > > > > > > > > to > > > > > > > > call kernfs_refresh_inode() from functions that are > > > > > > > > usually > > > > > > > > reading/checking functions. > > > > > > > > > > > > > > > > Would it be sufficient to refresh the inode in the > > > > > > > > write/set > > > > > > > > operations in (if there's any) places where things like > > > > > > > > setattr_copy() is not already called? > > > > > > > > > > > > > > > > Perhaps GKH or Tejun could comment on this? > > > > > > > > > > > > > > My memory is a bit hazy but invalidations on reads is how > > > > > > > sysfs > > > > > > > namespace is > > > > > > > implemented, so I don't think there's an easy around > > > > > > > that. > > > > > > > The > > > > > > > only > > > > > > > thing I > > > > > > > can think of is embedding the lock into attrs and doing > > > > > > > xchg > > > > > > > dance > > > > > > > when > > > > > > > attaching it. > > > > > > > > > > > > Sounds like your saying it would be ok to add a lock to the > > > > > > attrs structure, am I correct? > > > > > > > > > > > > Assuming it is then, to keep things simple, use two locks. > > > > > > > > > > > > One global lock for the allocation and an attrs lock for > > > > > > all > > > > > > the > > > > > > attrs field updates including the kernfs_refresh_inode() > > > > > > update. > > > > > > > > > > > > The critical section for the global lock could be reduced > > > > > > and > > > > > > it > > > > > > changed to a spin lock. > > > > > > > > > > > > In __kernfs_iattrs() we would have something like: > > > > > > > > > > > > take the allocation lock > > > > > > do the allocated checks > > > > > > assign if existing attrs > > > > > > release the allocation lock > > > > > > return existing if found > > > > > > othewise > > > > > > release the allocation lock > > > > > > > > > > > > allocate and initialize attrs > > > > > > > > > > > > take the allocation lock > > > > > > check if someone beat us to it > > > > > > free and grab exiting attrs > > > > > > otherwise > > > > > > assign the new attrs > > > > > > release the allocation lock > > > > > > return attrs > > > > > > > > > > > > Add a spinlock to the attrs struct and use it everywhere > > > > > > for > > > > > > field updates. > > > > > > > > > > > > Am I on the right track or can you see problems with this? > > > > > > > > > > > > Ian > > > > > > > > > > > > > > > > umm, we update the inode in kernfs_refresh_inode, right?? So > > > > > I > > > > > guess > > > > > the problem is how can we protect the inode when > > > > > kernfs_refresh_inode > > > > > is called, not the attrs?? > > > > > > > > But the attrs (which is what's copied from) were protected by > > > > the > > > > mutex lock (IIUC) so dealing with the inode attributes implies > > > > dealing with the kernfs node attrs too. > > > > > > > > For example in kernfs_iop_setattr() the call to setattr_copy() > > > > copies > > > > the node attrs to the inode under the same mutex lock. So, if a > > > > read > > > > lock is used the copy in kernfs_refresh_inode() is no longer > > > > protected, > > > > it needs to be protected in a different way. > > > > > > > > > > Ok, I'm actually wondering why the VFS holds exclusive i_rwsem > > > for > > > .setattr but > > > no lock for .getattr (misdocumented?? sometimes they have as > > > you've > > > found out)? > > > What does it protect against?? Because .permission does a similar > > > thing > > > here -- updating inode attributes, the goal is to provide the > > > same > > > protection level > > > for .permission as for .setattr, am I right??? > > > > As far as the documentation goes that's probably my > > misunderstanding > > of it. > > > > It does happen that the VFS makes assumptions about how call backs > > are meant to be used. > > > > Read like call backs, like .getattr() and .permission() are meant > > to > > be used, well, like read like functions so the VFS should be ok to > > take locks or not based on the operation context at hand. > > > > So it's not about the locking for these call backs per se, it's > > about > > the context in which they are called. > > > > For example, in link_path_walk(), at the beginning of the component > > lookup loop (essentially for the containing directory at that > > point), > > may_lookup() is called which leads to a call to .permission() > > without > > any inode lock held at that point. > > > > But file opens (possibly following a path walk to resolve a path) > > are different. > > > > For example, do_filp_open() calls path_openat() which leads to a > > call to open_last_lookups(), which leads to a call to .permission() > > along the way. And in this case there are two contexts, an open() > > create or one without create, the former needing the exclusive > > inode > > lock and the later able to use the shared lock. > > > > So it's about the locking needed for the encompassing operation > > that > > is being done not about those functions specifically. > > > > TBH the VFS is very complex and Al has a much, much better > > understanding of it than I do so he would need to be the one to > > answer > > whether it's the file systems responsibility to use these calls in > > the > > way the VFS expects. > > > > My belief is that if a file system needs to use a call back in a > > way > > that's in conflict with what the VFS expects it's the file systems' > > responsibility to deal with the side effects. > > > > Thanks for clarifying. Ian. > > Yeah, it's complex and confusing and it's very hard to spot lock > context by reading VFS code. > > I put code like this: > if (lockdep_is_held_type(&inode->i_rwsem, -1)) { > if (lockdep_is_held_type(&inode->i_rwsem, 0)) { > pr_warn("kernfs iop_permission inode WRITE > lock is held"); > } else if (lockdep_is_held_type(&inode->i_rwsem, 1)) > { > pr_warn("kernfs iop_permission inode READ > lock > is held"); > } > } else { > pr_warn("kernfs iop_permission inode lock is NOT > held"); > } > > in both .permission & .getattr. Then I do some open/read/write/close > to /sys, interestingly, all log outputs suggest they are in WRITE > lock > context. The thing is in open_last_lookups() called from path_openat() we have: if (open_flag & O_CREAT) inode_lock(dir->d_inode); else inode_lock_shared(dir->d_inode); and from there it's - lookup_open() and possibly may_o_create() which calls inode_permission() and onto .permission(). So, strictly speaking, it should be taken exclusive because you would expect may_o_create() to be called only on O_CREATE. But all the possible cases would need to be checked and if it is taken shared anywhere we are in trouble. Another example is in link_path_walk() we have a call to may_lookup() which leads to a call to .permission() without the lock being held. So there are a bunch of cases to check and knowing you are the owner of the lock if it is held is at least risky when concurrency is high and there's a possibility the lock can be taken shared. Ian