Re: [PATCH v2] fs: try an opportunistic lookup for O_CREAT opens too

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 08 Aug 2024 20:33:24 -0400

On Thu, 2024-08-08 at 20:28 -0400, Paul Moore wrote:
> On Thu, Aug 8, 2024 at 7:43 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > On Thu, 2024-08-08 at 17:12 -0400, Paul Moore wrote:
> > > On Thu, Aug 8, 2024 at 1:11 PM Jan Kara <jack@xxxxxxx> wrote:
> > > > On Thu 08-08-24 12:36:07, Christian Brauner wrote:
> > > > > On Wed, Aug 07, 2024 at 10:36:58AM GMT, Jeff Layton wrote:
> > > > > > On Wed, 2024-08-07 at 16:26 +0200, Christian Brauner wrote:
> > > > > > > > +static struct dentry *lookup_fast_for_open(struct nameidata *nd, int open_flag)
> > > > > > > > +{
> > > > > > > > +       struct dentry *dentry;
> > > > > > > > +
> > > > > > > > +       if (open_flag & O_CREAT) {
> > > > > > > > +               /* Don't bother on an O_EXCL create */
> > > > > > > > +               if (open_flag & O_EXCL)
> > > > > > > > +                       return NULL;
> > > > > > > > +
> > > > > > > > +               /*
> > > > > > > > +                * FIXME: If auditing is enabled, then we'll have to unlazy to
> > > > > > > > +                * use the dentry. For now, don't do this, since it shifts
> > > > > > > > +                * contention from parent's i_rwsem to its d_lockref spinlock.
> > > > > > > > +                * Reconsider this once dentry refcounting handles heavy
> > > > > > > > +                * contention better.
> > > > > > > > +                */
> > > > > > > > +               if ((nd->flags & LOOKUP_RCU) && !audit_dummy_context())
> > > > > > > > +                       return NULL;
> > > > > > > 
> > > > > > > Hm, the audit_inode() on the parent is done independent of whether the
> > > > > > > file was actually created or not. But the audit_inode() on the file
> > > > > > > itself is only done when it was actually created. Imho, there's no need
> > > > > > > to do audit_inode() on the parent when we immediately find that file
> > > > > > > already existed. If we accept that then this makes the change a lot
> > > > > > > simpler.
> > > > > > > 
> > > > > > > The inconsistency would partially remain though. When the file doesn't
> > > > > > > exist audit_inode() on the parent is called but by the time we've
> > > > > > > grabbed the inode lock someone else might already have created the file
> > > > > > > and then again we wouldn't audit_inode() on the file but we would have
> > > > > > > on the parent.
> > > > > > > 
> > > > > > > I think that's fine. But if that's bothersome the more aggressive thing
> > > > > > > to do would be to pull that audit_inode() on the parent further down
> > > > > > > after we created the file. Imho, that should be fine?...
> > > > > > > 
> > > > > > > See https://gitlab.com/brauner/linux/-/commits/vfs.misc.jeff/?ref_type=heads
> > > > > > > for a completely untested draft of what I mean.
> > > > > > 
> > > > > > Yeah, that's a lot simpler. That said, my experience when I've worked
> > > > > > with audit in the past is that people who are using it are _very_
> > > > > > sensitive to changes of when records get emitted or not. I don't like
> > > > > > this, because I think the rules here are ad-hoc and somewhat arbitrary,
> > > > > > but keeping everything working exactly the same has been my MO whenever
> > > > > > I have to work in there.
> > > > > > 
> > > > > > If a certain access pattern suddenly generates a different set of
> > > > > > records (or some are missing, as would be in this case), we might get
> > > > > > bug reports about this. I'm ok with simplifying this code in the way
> > > > > > you suggest, but we may want to do it in a patch on top of mine, to
> > > > > > make it simple to revert later if that becomes necessary.
> > > > > 
> > > > > Fwiw, even with the rearranged checks in v3 of the patch audit records
> > > > > will be dropped because we may find a positive dentry but the path may
> > > > > have trailing slashes. At that point we just return without audit
> > > > > whereas before we always would've done that audit.
> > > > > 
> > > > > Honestly, we should move that audit event as right now it's just really
> > > > > weird and see if that works. Otherwise the change is somewhat horrible
> > > > > complicating the already convoluted logic even more.
> > > > > 
> > > > > So I'm appending the patches that I have on top of your patch in
> > > > > vfs.misc. Can you (other as well ofc) take a look and tell me whether
> > > > > that's not breaking anything completely other than later audit events?
> > > > 
> > > > The changes look good as far as I'm concerned but let me CC audit guys if
> > > > they have some thoughts regarding the change in generating audit event for
> > > > the parent. Paul, does it matter if open(O_CREAT) doesn't generate audit
> > > > event for the parent when we are failing open due to trailing slashes in
> > > > the pathname? Essentially we are speaking about moving:
> > > > 
> > > >         audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
> > > > 
> > > > from open_last_lookups() into lookup_open().
> > > 
> > > Thanks for adding the audit mailing list to the CC, Jan.  I would ask
> > > for others to do the same when discussing changes that could impact
> > > audit (similar requests for the LSM framework, SELinux, etc.).
> > > 
> > > The inode/path logging in audit is ... something.  I have a
> > > longstanding todo item to go revisit the audit inode logging, both to
> > > fix some known bugs, and see what we can improve (I'm guessing quite a
> > > bit).  Unfortunately, there is always something else which is burning
> > > a little bit hotter and I haven't been able to get to it yet.
> > > 
> > 
> > It is "something" alright. The audit logging just happens at strange
> > and inconvenient times vs. what else we're trying to do wrt pathwalking
> > and such. In particular here, the fact __audit_inode can block is what
> > really sucks.
> > 
> > Since we're discussing it...
> > 
> > ISTM that the inode/path logging here is something like a tracepoint.
> > In particular, we're looking to record a specific set of information at
> > specific points in the code. One of the big differences between them
> > however is that tracepoints don't block.  The catch is that we can't
> > just drop messages if we run out of audit logging space, so that would
> > have to be handled reasonably.
> 
> Yes, the buffer allocation is the tricky bit.  Audit does preallocate
> some structs for tracking names which ideally should handle the vast
> majority of the cases, but yes, we need something to handle all of the
> corner cases too without having to resort to audit_panic().
> 
> > I wonder if we could leverage the tracepoint infrastructure to help us
> > record the necessary info somehow? Copy the records into a specific
> > ring buffer, and then copy them out to the audit infrastructure in
> > task_work?
> 
> I believe using task_work will cause a number of challenges for the
> audit subsystem as we try to bring everything together into a single
> audit event.  We've had a lot of problems with io_uring doing similar
> things, some of which are still unresolved.
> 
> > I don't have any concrete ideas here, but the path/inode audit code has
> > been a burden for a while now and it'd be good to think about how we
> > could do this better.
> 
> I've got some grand ideas on how to cut down on a lot of our
> allocations and string generation in the critical path, not just with
> the inodes, but with audit records in general.  Sadly I just haven't
> had the time to get to any of it.
> 
> > > The general idea with audit is that you want to record the information
> > > both on success and failure.  It's easy to understand the success
> > > case, as it is a record of what actually happened on the system, but
> > > you also want to record the failure case as it can provide some
> > > insight on what a process/user is attempting to do, and that can be
> > > very important for certain classes of users.  I haven't dug into the
> > > patches in Christian's tree, but in general I think Jeff's guidance
> > > about not changing what is recorded in the audit log is probably good
> > > advice (there will surely be exceptions to that, but it's still good
> > > guidance).
> > > 
> > 
> > In this particular case, the question is:
> > 
> > Do we need to emit a AUDIT_INODE_PARENT record when opening an existing
> > file, just because O_CREAT was set? We don't emit such a record when
> > opening without O_CREAT set.
> 
> I'm not as current on the third-party security requirements as I used
> to be, but I do know that oftentimes when a file is created the parent
> directory is an important bit of information to have in the audit log.
> 

Right. We'd still have that here since we have to unlazy to actually
create the file.

The question here is about the case where O_CREAT is set, but the file
already exists. Nothing is being created in that case, so do we need to
emit an audit record for the parent?
-- 
Jeff Layton <jlayton@xxxxxxxxxx>