audit watch and kernfs

Amir Goldstein <amir73il@xxxxxxxxx> · Fri, 27 Aug 2021 12:36:23 +0300

[Fork new thread from:
https://lore.kernel.org/linux-fsdevel/CAHC9VhT9SE6+kLYBh2d7CW5N6RCr=_ryK+ncGvqYJ51B7_egPA@xxxxxxxxxxxxxx/
and shrink CC list]

> > > >> Hi Amir,
> > > >>
> > > >> I think this is actually necessary.  I could identify at least one event
> > > >> (FS_CREATE | FS_ISDIR) where fsnotify is invoked with a NULL data field.
> > > >> In that case, fsnotify_dirent is called with a negative dentry from
> > > >> vfs_mkdir().  I'm not sure why exactly the dentry is negative after the
> > > >
> > > > That doesn't sound right at all.
> > > > Are you sure about this?
> > > > Which filesystem was this mkdir called on?
> > >
> > > You should be able to reproduce it on top of mainline if you pick only this
> > > patch and do the change you suggested:
> > >
> > >  -       sb = inode->i_sb;
> > >  +       sb = fsnotify_data_sb(data, data_type);
> > >
> > > And then boot a Debian stable with systemd.  The notification happens on
> > > the cgroup pseudo-filesystem (/sys/fs/cgroup), which is being monitored
> > > by systemd itself.  The event that arrives with a NULL data is telling the
> > > directory /sys/fs/cgroup/*/ about the creation of directory
> > > `init.scope`.
> > >
> > > The change above triggers the following null dereference of struct
> > > super_block, since data is NULL.
> > >
> > > I will keep looking but you might be able to answer it immediately...
> >
> > Yes, I see what is going on.
> >
> > cgroupfs is a sort of kernfs and kernfs_iop_mkdir() does not instantiate
> > the negative dentry. Instead, kernfs_dop_revalidate() always invalidates
> > negative dentries to force re-lookup to find the inode.
> >
> > Documentation/filesystems/vfs.rst says on create() and friends:
> > "...you will probably call d_instantiate() with the dentry and the
> >   newly created inode..."
> >
> > So this behavior seems legit.
> > Meaning that we have made a wrong assumption in fsnotify_create()
> > and fsnotify_mkdir().
> > Please note the comment above fsnotify_link() which anticipates
> > negative dentries.
> >
> > I've audited the fsnotify backends and it seems that the
> > WARN_ON(!inode) in kernel/audit_* is the only immediate implication
> > of negative dentry with FS_CREATE.
> > I am the one who added these WARN_ON(), so I will remove them.
> > I think that missing inode in an FS_CREATE event really breaks
> > audit on kernfs, but not sure if that is a valid use case (Paul?).
>
> While it is tempting to ignore kernfs from an audit filesystem watch
> perspective, I can see admins potentially wanting to watch
> kernfs/cgroupfs/other-config-pseudofs to detect who is potentially
> playing with the system config.  Arguably the most important config
> changes would already be audited if they were security relevant, but I
> could also see an admin wanting to watch for *any* changes so it's
> probably best not to rule out a kernfs based watch right now.
>
> I'm sure I'm missing some details, but from what I gather from the
> portion of the thread that I'm seeing, it looks like the audit issue
> lies in audit_mark_handle_event() and audit_watch_handle_event().  In
> both cases it looks like the functions are at least safe with a NULL
> inode pointer, even with the WARN_ON() removed; the problem being that

Correct. They are safe.

> the mark and watch will not be updated with the device and inode
> number which means the audit filters based on those marks/watches will
> not trigger.  Is that about right or did I read the thread and code a
> bit too quickly?
>

That is also my understanding of the code although I must admit
I did not try to test this setup.

> Working under the assumption that the above is close enough to
> correct, that is a bit of a problem as it means audit can't
> effectively watch kernfs based filesystems, although it sounds like it
> wasn't really working properly to begin with, yes?  Before I start

Correct. It seems it was always like that (I did not check history of kernfs)
but do note that users can set audit rules on specific kernfs directories or
files, it's only recursive rules on subtree may not work as expected

> thinking too hard about this, does anyone already have a great idea to
> fix this that they want to share?
>

One idea I had, it may not be a great idea, but I'll share it anyway :)

Introduce an event FS_INSTANTIATE that will only be exposed to
internal kernel fsnotify backends. It triggers when an inode is linked
into the dentry cache as a result of lookup() or mkdir() or whatever.

So for example, in case of audit watch on kernfs,
audit_watch_handle_event() will miss the FS_CREATE event, but
on the first user access to the new created directory, audit will get
the FS_INSTANTIATE event with child inode and be able to setup
the recursive watch rule.

I did not check if it is possible or easy to d_instantiate() in kernfs
mkdir() etc like other filesystems do and I do not know if it would be
possible to enforce that as a strict vfs API rather than a recommendation.

Thanks,
Amir.