On Thu, Apr 4, 2019 at 5:40 PM Richard Guy Briggs <rgb@xxxxxxxxxx> wrote: > On 2019-04-02 07:31, Neil Horman wrote: > > On Mon, Apr 01, 2019 at 10:50:03AM -0400, Paul Moore wrote: > > > On Fri, Mar 15, 2019 at 2:35 PM Richard Guy Briggs <rgb@xxxxxxxxxx> wrote: > > > > Audit events could happen in a network namespace outside of a task > > > > context due to packets received from the net that trigger an auditing > > > > rule prior to being associated with a running task. The network > > > > namespace could be in use by multiple containers by association to the > > > > tasks in that network namespace. We still want a way to attribute > > > > these events to any potential containers. Keep a list per network > > > > namespace to track these audit container identifiiers. > > > > > > > > Add/increment the audit container identifier on: > > > > - initial setting of the audit container identifier via /proc > > > > - clone/fork call that inherits an audit container identifier > > > > - unshare call that inherits an audit container identifier > > > > - setns call that inherits an audit container identifier > > > > Delete/decrement the audit container identifier on: > > > > - an inherited audit container identifier dropped when child set > > > > - process exit > > > > - unshare call that drops a net namespace > > > > - setns call that drops a net namespace > > > > > > > > See: https://github.com/linux-audit/audit-kernel/issues/92 > > > > See: https://github.com/linux-audit/audit-testsuite/issues/64 > > > > See: https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID > > > > Signed-off-by: Richard Guy Briggs <rgb@xxxxxxxxxx> > > > > --- > > > > include/linux/audit.h | 19 ++++++++++++ > > > > kernel/audit.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++-- > > > > kernel/nsproxy.c | 4 +++ > > > > 3 files changed, 106 insertions(+), 3 deletions(-) ... > > > > + list_for_each_entry_rcu(cont, contid_list, list) > > > > + if (cont->id == contid) { > > > > + refcount_inc(&cont->refcount); > > > > + goto out; > > > > + } > > > > + cont = kmalloc(sizeof(struct audit_contid), GFP_ATOMIC); > > > > > > If you had to guess, what do you think is going to be more common: > > > bumping the refcount of an existing entry in the list, or adding a new > > > entry? I'm asking because I always get a little nervous when doing > > > allocations while holding a spinlock. Yes, you are doing it with > > > GFP_ATOMIC, but it still seems like something to try and avoid if this > > > is going to approach 50%. However, if the new entry is rare then the > > > extra work of always doing the allocation before taking the lock and > > > then freeing it afterwards might be a bad tradeoff. > > > > > I think this is another way of asking, will multiple processes exist in the same > > network namespace? That is to say, will we expect a process to create a net > > namespace, and then have other processes enter that namespace (thereby > > triggering multiple calls to audit_netns_contid_add with the same net pointer > > and cont id). Given that the kubernetes notion of a pod is almost by definition > > multiple containers sharing a network namespace, I think the answer is that we > > will be strongly biased towards the refcount_inc case, rather than the kmalloc > > case. I could be wrong, but I think this is likely already in the optimized > > order. > > I had a stab at doing a GFP_KERNEL alloc before the spinlock and releasing it > after if it wasn't needed (in Patch#1 below). I also went one step further and > converted it to kmem_cache (in Patch#2 below). > > > > My gut feeling says we might do about as many allocations as refcount > > > bumps, but I could be thinking about this wrong. > > > > > > Moving the allocation outside the spinlock might also open the door to > > > doing this as GFP_KERNEL, which is a good thing, but I haven't looked > > > at the callers to see if that is possible (it may not be). That's an > > > exercise left to the patch author (if he hasn't done that already). > > Both appear to work, but after successfully running both the contid test and > audit_testsuite, once I start to push that test system a bit harder I end up > with a deadlock warning. > > I am having trouble understanding why since it happens both without and with > the kmem_cache options, so it must be another part of the code that is > triggering it. The task_lock is being held at this point in > audit_set_contid(). I haven't tried changing this alloc over to a GFP_ATOMIC > to see if that prevents it, just as a debugging check... > At this point, I'm inclined to leave it as it was without these two patches > since it works and there doesn't seem to be an obvious best way given the > uncertainty of the potential workloads. I'm definitely not a mm expert, but I wonder if you might be able to get this working using GFP_NOFS, but I see just now that they recommend you *not* use GFP_NOFS; even if this did solve the problem, it's probably not the right approach. I'm okay with leaving it as-is with GFP_ATOMIC. My original response to this was more of a question about optimizing for a given use case, having something that works is far more important. -- paul moore www.paul-moore.com