Re: [RFC PATCH] lsm: add the inode_free_security_rcu() LSM implementation hook

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 24 Jul 2024 09:17:29 +1000

On Tue, Jul 23, 2024 at 05:19:40PM +0200, Christian Brauner wrote:
> On Tue, Jul 23, 2024 at 01:34:44PM GMT, Dave Chinner wrote:
> > All accesses to the VFS inode that don't have explicit reference
> > counts have to do these checks...
> > 
> > IIUC, at the may_lookup() point, the RCU pathwalk doesn't have a
> > fully validate reference count to the dentry or the inode at this
> > point, so it seems accessing random objects attached to an inode
> > that can be anywhere in the setup or teardown process isn't at all
> > safe...
> 
> may_lookup() cannot encounter inodes in random states. It will start
> from a well-known struct path and sample sequence counters for rename,
> mount, and dentry changes. Each component will be subject to checks
> after may_lookup() via these sequence counters to ensure that no change
> occurred that would invalidate the lookup just done. To be precise to
> ensure that no state could be reached via rcu that couldn't have been
> reached via ref walk.
> 
> So may_lookup() may be called on something that's about to be freed
> (concurrent unlink on a directory that's empty that we're trying to
> create or lookup something nonexistent under) while we're looking at it
> but all the machinery is in place so that it will be detected and force
> a drop out of rcu and into reference walking mode.

Yes, but...

> When may_lookup() calls inode_permission() it only calls into the
> filesystem itself if the filesystem has a custom i_op->permission()
> handler. And if it has to call into the filesystem it passes
> MAY_NOT_BLOCK to inform the filesystem about this. And in those cases
> the filesystem must ensure any additional data structures can safely be
> accessed under rcu_read_lock() (documented in path_lookup.rst).

The problem isn't the call into the filesystem - it's may_lookup()
passing the inode to security modules where we dereference
inode->i_security and assume that it is valid for the life of the
object access being made.

That's my point - if we have a lookup race and the inode is being
destroyed at this point (i.e. I_FREEING is set) inode->i_security
*is not valid* and should not be accessed by *anyone*.

> Checking inode state flags isn't needed because the VFS already handles
> all of this via other means as e.g., sequence counters in various core
> structs.

I don't think that is sufficient to avoid races with inode teardown.
The inode can be destroyed between the initial dentry count sequence
sampling and the "done processing, check that the seq count is
unchanged" validation to solidify the path.

We've seen this before with ->get_link fast path that stores the
link name in inode->i_link, and both inode->i_link and ->get_link
are accessed during RCU It is documented as such:

	If the filesystem stores the symlink target in ->i_link, the
        VFS may use it directly without calling ->get_link(); however,
        ->get_link() must still be provided.  ->i_link must not be
        freed until after an RCU grace period.  Writing to ->i_link
        post-iget() time requires a 'release' memory barrier.

XFS doesn't use RCU mode ->get_link or use ->i_link anymore, because
its has a corner case in it's inode life cycle since 2007 where it
can recycle inodes before the RCU grace period expires. It took 15
years for this issue to be noticed, but the fix for this specific
symptom is to not allow the VFS direct access to the underlying
filesystem memory that does not follow VFS inode RCU lifecycle
rules.

There was another symptom of this issue - ->get_link changed (i.e.
went to null) on certain types of inode reuse - and that caused
pathwalk to panic on a NULL pointer. Ian posted this fix for the
issue:

https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@xxxxxxxxxxxxxxxxx/

Which revalidates the dentry sequence number before calling
->get_link(). This clearly demonstrates we cannot rely on the
existing pathwalk dentry sequence number checks to catch an inode
concurrently moving into I_FREEING state and changing/freeing inode
object state concurrently in a way that affects the pathwalk
operation.

My point here is not about XFS - my point is that every object
attached to an inode that is accessed during a RCU pathwalk *must*
follow the same rules as memory attached to inode->i_link. The only
alternative to that (i.e. if we continue to free RCU pathwalk
visible objects in the evict() path) is to prevent pathwalk from
tripping over I_FREEING inodes.

If we decide that every pathwalk accessed object attached to the
inode needs to be RCU freed, then __destroy_inode() is the wrong
place to be freeing them - i_callback() (the call_rcu() inode
freeing callback) is the place to be freeing these objects. At this
point, the subsystems that own the objects don't have to care about
RCU at all - the objects have already gone through the necessary
grace period that makes them safe to be freed immediately.

That's a far better solution than forcing every LSM and FS developer
to have to fully understand how pathwalk and inode lifecycles
interact to get stuff right...

> It also likely wouldn't help because we'd have to take locks to
> access i_state or sample i_state before calling into inode_permission()
> and then it could still change behind our back. It's also the wrong
> layer as we're dealing almost exclusively with dentries as the main data
> structure during lookup.

Yup, I never said that's how we should fix the problem.  All I'm
stating is that pathwalk vs I_FREEING is currently not safe and that
I_FREEING is detectable from the pathwalk side. This observation
could help provide a solution, but it's almost certainly not the
solution to the problem...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx