Re: [RFC PATCH] lsm: add the inode_free_security_rcu() LSM implementation hook

Christian Brauner <brauner@xxxxxxxxxx> · Tue, 23 Jul 2024 17:19:40 +0200

On Tue, Jul 23, 2024 at 01:34:44PM GMT, Dave Chinner wrote:
> On Mon, Jul 22, 2024 at 03:46:36PM -0400, Paul Moore wrote:
> > On Mon, Jul 22, 2024 at 8:30 AM Matus Jokay <matus.jokay@xxxxxxxx> wrote:
> > > On 10. 7. 2024 12:40, Mickaël Salaün wrote:
> > > > On Tue, Jul 09, 2024 at 10:40:30PM -0400, Paul Moore wrote:
> > > >> The LSM framework has an existing inode_free_security() hook which
> > > >> is used by LSMs that manage state associated with an inode, but
> > > >> due to the use of RCU to protect the inode, special care must be
> > > >> taken to ensure that the LSMs do not fully release the inode state
> > > >> until it is safe from a RCU perspective.
> > > >>
> > > >> This patch implements a new inode_free_security_rcu() implementation
> > > >> hook which is called when it is safe to free the LSM's internal inode
> > > >> state.  Unfortunately, this new hook does not have access to the inode
> > > >> itself as it may already be released, so the existing
> > > >> inode_free_security() hook is retained for those LSMs which require
> > > >> access to the inode.
> > > >>
> > > >> Signed-off-by: Paul Moore <paul@xxxxxxxxxxxxxx>
> > > >
> > > > I like this new hook.  It is definitely safer than the current approach.
> > > >
> > > > To make it more consistent, I think we should also rename
> > > > security_inode_free() to security_inode_put() to highlight the fact that
> > > > LSM implementations should not free potential pointers in this blob
> > > > because they could still be dereferenced in a path walk.
> > > >
> > > >> ---
> > > >>  include/linux/lsm_hook_defs.h     |  1 +
> > > >>  security/integrity/ima/ima.h      |  2 +-
> > > >>  security/integrity/ima/ima_iint.c | 20 ++++++++------------
> > > >>  security/integrity/ima/ima_main.c |  2 +-
> > > >>  security/landlock/fs.c            |  9 ++++++---
> > > >>  security/security.c               | 26 +++++++++++++-------------
> > > >>  6 files changed, 30 insertions(+), 30 deletions(-)
> > 
> > ...
> > 
> > > Sorry for the questions, but for several weeks I can't find answers to two things related to this RFC:
> > >
> > > 1) How does this patch close [1]?
> > >    As Mickaël pointed in [2], "It looks like security_inode_free() is called two times on the same inode."
> > >    Indeed, it does not seem from the backtrace that it is a case of race between destroy_inode and inode_permission,
> > >    i.e. referencing the inode in a VFS path walk while destroying it...
> > >    Please, can anyone tell me how this situation could have happened? Maybe folks from VFS... I added them to the copy.
> > 
> > The VFS folks can likely provide a better, or perhaps a more correct
> > answer, but my understanding is that during the path walk the inode is
> > protected by a RCU lock which allows for multiple threads to access
> > the inode simultaneously; this could result in some cases where one
> > thread is destroying the inode while another is accessing it.
> 
> Shouldn't may_lookup() be checking the inode for (I_NEW |
> I_WILLFREE | I_FREE) so that it doesn't access an inode either not
> completely initialised or being evicted during the RCU path walk?

Going from memory since I don't have time to go really into the weeds.

A non-completely initalised inode shouldn't appear in path lookup.
Before the inode is attached to a dentry I_NEW would have been removed
otherwise this is a bug. That can either happen via unlock_new_inode()
and d_splice_alias() or in some cases directly via d_instantiate_new().
Concurrent inode lookup calls on the same inode (e.g., iget_locked() and
friends) will sleep until I_NEW is cleared.

> All accesses to the VFS inode that don't have explicit reference
> counts have to do these checks...
> 
> IIUC, at the may_lookup() point, the RCU pathwalk doesn't have a
> fully validate reference count to the dentry or the inode at this
> point, so it seems accessing random objects attached to an inode
> that can be anywhere in the setup or teardown process isn't at all
> safe...

may_lookup() cannot encounter inodes in random states. It will start
from a well-known struct path and sample sequence counters for rename,
mount, and dentry changes. Each component will be subject to checks
after may_lookup() via these sequence counters to ensure that no change
occurred that would invalidate the lookup just done. To be precise to
ensure that no state could be reached via rcu that couldn't have been
reached via ref walk.

So may_lookup() may be called on something that's about to be freed
(concurrent unlink on a directory that's empty that we're trying to
create or lookup something nonexistent under) while we're looking at it
but all the machinery is in place so that it will be detected and force
a drop out of rcu and into reference walking mode.

When may_lookup() calls inode_permission() it only calls into the
filesystem itself if the filesystem has a custom i_op->permission()
handler. And if it has to call into the filesystem it passes
MAY_NOT_BLOCK to inform the filesystem about this. And in those cases
the filesystem must ensure any additional data structures can safely be
accessed under rcu_read_lock() (documented in path_lookup.rst).

If the filesystem detects that it cannot safely handle this or detects
that something is invalid it can return -ECHILD causing the VFS to drop
out of rcu and into ref walking mode to redo the lookup. That may happen
directly in may_lookup() it unconditionally happens in walk_component()
when it's verified that the parent had no changes while we were looking
at it.

The same logic extends to security modules. Both selinux and smack
handle MAY_NOT_BLOCK calls from security_inode_permission() with e.g.,
selinux returning -ECHILD in case the inode security context isn't
properly initialized causing the VFS to drop into ref walking mode and
allowing selinux to redo the initialization.

Checking inode state flags isn't needed because the VFS already handles
all of this via other means as e.g., sequence counters in various core
structs. It also likely wouldn't help because we'd have to take locks to
access i_state or sample i_state before calling into inode_permission()
and then it could still change behind our back. It's also the wrong
layer as we're dealing almost exclusively with dentries as the main data
structure during lookup.

Fwiw, a bunch of this is documented in path_lookup.rst, vfs.rst, and
porting.rst.

(I'm running out of time with other stuff so I want to point out that I
can't really spend a lot more time on this thread tbh.)