On Tue, Jul 23, 2024 at 05:19:40PM +0200, Christian Brauner wrote: > On Tue, Jul 23, 2024 at 01:34:44PM GMT, Dave Chinner wrote: > > All accesses to the VFS inode that don't have explicit reference > > counts have to do these checks... > > > > IIUC, at the may_lookup() point, the RCU pathwalk doesn't have a > > fully validate reference count to the dentry or the inode at this > > point, so it seems accessing random objects attached to an inode > > that can be anywhere in the setup or teardown process isn't at all > > safe... > > may_lookup() cannot encounter inodes in random states. It will start > from a well-known struct path and sample sequence counters for rename, > mount, and dentry changes. Each component will be subject to checks > after may_lookup() via these sequence counters to ensure that no change > occurred that would invalidate the lookup just done. To be precise to > ensure that no state could be reached via rcu that couldn't have been > reached via ref walk. > > So may_lookup() may be called on something that's about to be freed > (concurrent unlink on a directory that's empty that we're trying to > create or lookup something nonexistent under) while we're looking at it > but all the machinery is in place so that it will be detected and force > a drop out of rcu and into reference walking mode. Yes, but... > When may_lookup() calls inode_permission() it only calls into the > filesystem itself if the filesystem has a custom i_op->permission() > handler. And if it has to call into the filesystem it passes > MAY_NOT_BLOCK to inform the filesystem about this. And in those cases > the filesystem must ensure any additional data structures can safely be > accessed under rcu_read_lock() (documented in path_lookup.rst). The problem isn't the call into the filesystem - it's may_lookup() passing the inode to security modules where we dereference inode->i_security and assume that it is valid for the life of the object access being made. That's my point - if we have a lookup race and the inode is being destroyed at this point (i.e. I_FREEING is set) inode->i_security *is not valid* and should not be accessed by *anyone*. > Checking inode state flags isn't needed because the VFS already handles > all of this via other means as e.g., sequence counters in various core > structs. I don't think that is sufficient to avoid races with inode teardown. The inode can be destroyed between the initial dentry count sequence sampling and the "done processing, check that the seq count is unchanged" validation to solidify the path. We've seen this before with ->get_link fast path that stores the link name in inode->i_link, and both inode->i_link and ->get_link are accessed during RCU It is documented as such: If the filesystem stores the symlink target in ->i_link, the VFS may use it directly without calling ->get_link(); however, ->get_link() must still be provided. ->i_link must not be freed until after an RCU grace period. Writing to ->i_link post-iget() time requires a 'release' memory barrier. XFS doesn't use RCU mode ->get_link or use ->i_link anymore, because its has a corner case in it's inode life cycle since 2007 where it can recycle inodes before the RCU grace period expires. It took 15 years for this issue to be noticed, but the fix for this specific symptom is to not allow the VFS direct access to the underlying filesystem memory that does not follow VFS inode RCU lifecycle rules. There was another symptom of this issue - ->get_link changed (i.e. went to null) on certain types of inode reuse - and that caused pathwalk to panic on a NULL pointer. Ian posted this fix for the issue: https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@xxxxxxxxxxxxxxxxx/ Which revalidates the dentry sequence number before calling ->get_link(). This clearly demonstrates we cannot rely on the existing pathwalk dentry sequence number checks to catch an inode concurrently moving into I_FREEING state and changing/freeing inode object state concurrently in a way that affects the pathwalk operation. My point here is not about XFS - my point is that every object attached to an inode that is accessed during a RCU pathwalk *must* follow the same rules as memory attached to inode->i_link. The only alternative to that (i.e. if we continue to free RCU pathwalk visible objects in the evict() path) is to prevent pathwalk from tripping over I_FREEING inodes. If we decide that every pathwalk accessed object attached to the inode needs to be RCU freed, then __destroy_inode() is the wrong place to be freeing them - i_callback() (the call_rcu() inode freeing callback) is the place to be freeing these objects. At this point, the subsystems that own the objects don't have to care about RCU at all - the objects have already gone through the necessary grace period that makes them safe to be freed immediately. That's a far better solution than forcing every LSM and FS developer to have to fully understand how pathwalk and inode lifecycles interact to get stuff right... > It also likely wouldn't help because we'd have to take locks to > access i_state or sample i_state before calling into inode_permission() > and then it could still change behind our back. It's also the wrong > layer as we're dealing almost exclusively with dentries as the main data > structure during lookup. Yup, I never said that's how we should fix the problem. All I'm stating is that pathwalk vs I_FREEING is currently not safe and that I_FREEING is detectable from the pathwalk side. This observation could help provide a solution, but it's almost certainly not the solution to the problem... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx