Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog

"Dr. Greg" <greg@xxxxxxxxxxxx> · Wed, 20 Nov 2024 10:54:25 -0600

On Tue, Nov 19, 2024 at 10:14:29AM -0800, Casey Schaufler wrote:

Good morning, I hope the day is goning well for everyone.

> On 11/19/2024 4:27 AM, Dr. Greg wrote:
> > On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote:
> >
> >> Hi Christian, James and Jan, 
> > Good morning, I hope the day is starting well for everyone.
> >
> >>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> >> [...]
> >>
> >>>> We can address this with something like following:
> >>>>
> >>>> #ifdef CONFIG_SECURITY
> >>>>         void                    *i_security;
> >>>> #elif CONFIG_BPF_SYSCALL
> >>>>         struct bpf_local_storage __rcu *i_bpf_storage;
> >>>> #endif
> >>>>
> >>>> This will help catch all misuse of the i_bpf_storage at compile
> >>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. 
> >>>>
> >>>> Does this make sense?
> >>> Got to say I'm with Casey here, this will generate horrible and failure
> >>> prone code.
> >>>
> >>> Since effectively you're making i_security always present anyway,
> >>> simply do that and also pull the allocation code out of security.c in a
> >>> way that it's always available?  That way you don't have to special
> >>> case the code depending on whether CONFIG_SECURITY is defined. 
> >>> Effectively this would give everyone a generic way to attach some
> >>> memory area to an inode.  I know it's more complex than this because
> >>> there are LSM hooks that run from security_inode_alloc() but if you can
> >>> make it work generically, I'm sure everyone will benefit.
> >> On a second thought, I think making i_security generic is not 
> >> the right solution for "BPF inode storage in tracing use cases". 
> >>
> >> This is because i_security serves a very specific use case: it 
> >> points to a piece of memory whose size is calculated at system 
> >> boot time. If some of the supported LSMs is not enabled by the 
> >> lsm= kernel arg, the kernel will not allocate memory in 
> >> i_security for them. The only way to change lsm= is to reboot 
> >> the system. BPF LSM programs can be disabled at the boot time, 
> >> which fits well in i_security. However, BPF tracing programs 
> >> cannot be disabled at boot time (even we change the code to 
> >> make it possible, we are not likely to disable BPF tracing). 
> >> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some 
> >> BPF tracing programs to load at some point of time, and these 
> >> programs may use BPF inode storage. 
> >>
> >> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory 
> >> always will be attached to i_security (maybe under a different 
> >> name, say, i_generic) of every inode. In this case, we should 
> >> really add i_bpf_storage directly to the inode, because another 
> >> pointer jump via i_generic gives nothing but overhead. 
> >>
> >> Does this make sense? Or did I misunderstand the suggestion?
> > There is a colloquialism that seems relevant here: "Pick your poison".
> >
> > In the greater interests of the kernel, it seems that a generic
> > mechanism for attaching per inode information is the only realistic
> > path forward, unless Christian changes his position on expanding
> > the size of struct inode.
> >
> > There are two pathways forward.
> >
> > 1.) Attach a constant size 'blob' of storage to each inode.
> >
> > This is a similar approach to what the LSM uses where each blob is
> > sized as follows:
> >
> > S = U * sizeof(void *)
> >
> > Where U is the number of sub-systems that have a desire to use inode
> > specific storage.

> I can't tell for sure, but it looks like you don't understand how
> LSM i_security blobs are used. It is *not* the case that each LSM
> gets a pointer in the i_security blob. Each LSM that wants storage
> tells the infrastructure at initialization time how much space it
> wants in the blob. That can be a pointer, but usually it's a struct
> with flags, pointers and even lists.

I can state unequivocably for everyone's benefit, that as a team, we
have an intimate understanding of how LSM i_security blobs are used.

It was 0500 in the morning when I wrote the reply and I had personally
been working for 22 hours straight, so my apologies for being
imprecise.

I should not have specified sizeof(void *), I should have written
sizeof(allocation), for lack of a better syntax.

Also for the record, in a universal allocation scheme, when I say
sub-system I mean any implementation that would make use of per inode
information.  So the LSM, bpf tracing et.al., could all be considered
sub-systems that would register at boot time for a section of the
arena.

> > Each sub-system uses it's pointer slot to manage any additional
> > storage that it desires to attach to the inode.

> Again, an LSM may choose to do it that way, but most don't.  SELinux
> and Smack need data on every inode. It makes much more sense to put
> it directly in the blob than to allocate a separate chunk for every
> inode.

See my correction above.

> > This has the obvious advantage of O(1) cost complexity for any
> > sub-system that wants to access its inode specific storage.
> >
> > The disadvantage, as you note, is that it wastes memory if a
> > sub-system does not elect to attach per inode information, for example
> > the tracing infrastructure.

> To be clear, that disadvantage only comes up if the sub-system uses
> inode data on an occasional basis. If it never uses inode data there
> is no need to have a pointer to it.

I think we all agree on that, therein lies the rub with a common arena
architecture, which is why I indicated in my earlier e-mail that this
comes down to engineering trade-off decisions.

That is why there would be a probable assumption that such sub-systems
would only request a pointer per arena slot and use that to reference
a dynamically allocated structure.  If, as a group, we are really
concerned about inode memory consumption the assumption would be that
the maintainers would have to whine about sparse consumers requesting
a structure sized allocation rather than a pointer sized allocation.

> > This disadvantage is parried by the fact that it reduces the size of
> > the inode proper by 24 bytes (4 pointers down to 1) and allows future
> > extensibility without colliding with the interests and desires of the
> > VFS maintainers.

> You're adding a level of indirection. Even I would object based on
> the performance impact.

I'm not sure that a thorough and complete analysis of the costs
associated with a sub-system touching inode local storage would
support the notion of a tangible performance hit.

The pointer in an arena slot would presumably be a pointer to a data
structure that a sub-system allocates at inode creation time.  After
computing the arena slot address in classic style (i_arena + offset)
the sub-system uses the pointer at that location to dereference
its structure elements or to find subordinate members in its arena.

If we take SMACK as an example, the smack inode contains three
pointers and a scalar.  So, if there is a need to access storage behind
one of those pointers, there is an additional indirection hit.

The three pointers are each to a structure (smack_known) that has
three list pointers and a mutex lock inside of it.

The SeLinux inode has a back pointer to the sponsoring inode, a list
head, a spinlock and some scalars.

So there is lots of potential indirection and locking going on with
access to inode local storage.

To extend further, for everyone thinking about this from an
engineering perspective.

A common arena model where everyone asks for a structure sized blob is
inherently cache pessimal.  Unless you are the first blob in the arena
you are going to need to hit another cache-line in order to start the
indirection process for your structure.

A pointer based arena architecture would allow up to eight sub-systems
to get their inode storage pointer for the cost of a single cache-line
fetch.

Let me offer another line of thinking on this drawn from the
discussion above.

A further optimization in the single pointer arena model is for the
LSM to place a pointer to a standard LSM sized memory blob in its
pointer slot on behalf of all the individual LSM's.  All of the
individual participating LSM's take that pointer and do the offset
calculation into the LSM arena for that inode as they normally would.

So there would seem to be a lot of engineering issues to consider that
are beyond the simple predicate that indirection is bad.

See, I do understand how the LSM arena model works.

> > 2.) Implement key/value mapping for inode specific storage.
> >
> > The key would be a sub-system specific numeric value that returns a
> > pointer the sub-system uses to manage its inode specific memory for a
> > particular inode.
> >
> > A participating sub-system in turn uses its identifier to register an
> > inode specific pointer for its sub-system.
> >
> > This strategy loses O(1) lookup complexity but reduces total memory
> > consumption and only imposes memory costs for inodes when a sub-system
> > desires to use inode specific storage.

> SELinux and Smack use an inode blob for every inode. The performance
> regression boggles the mind. Not to mention the additional
> complexity of managing the memory.

I guess we would have to measure the performance impacts to understand
their level of mind boggliness.

My first thought is that we hear a huge amount of fanfare about BPF
being a game changer for tracing and network monitoring.  Given
current networking speeds, if its ability to manage storage needed for
it purposes are truely abysmal the industry wouldn't be finding the
technology useful.

Beyond that.

As I noted above, the LSM could be an independent subscriber.  The
pointer to register would come from the the kmem_cache allocator as it
does now, so that cost is idempotent with the current implementation.
The pointer registration would also be a single instance cost.

So the primary cost differential over the common arena model will be
the complexity costs associated with lookups in a red/black tree, if
we used the old IMA integrity cache as an example implementation.

As I noted above, these per inode local storage structures are complex
in of themselves, including lists and locks.  If touching an inode
involves locking and walking lists and the like it would seem that
those performance impacts would quickly swamp an r/b lookup cost.

> > Approach 2 requires the introduction of generic infrastructure that
> > allows an inode's key/value mappings to be located, presumably based
> > on the inode's pointer value.  We could probably just resurrect the
> > old IMA iint code for this purpose.
> >
> > In the end it comes down to a rather standard trade-off in this
> > business, memory vs. execution cost.
> >
> > We would posit that option 2 is the only viable scheme if the design
> > metric is overall good for the Linux kernel eco-system.

> No. Really, no. You need look no further than secmarks to understand
> how a key based blob allocation scheme leads to tears. Keys are fine
> in the case where use of data is sparse. They have no place when data
> use is the norm.

Then it would seem that we need to get everyone to agree that we can
get by with using two pointers in struct inode.  One for uses best
served by common arena allocation and one for a key/pointer mapping,
and then convert the sub-systems accordingly.

Or alternately, getting everyone to agree that allocating a mininum of
eight additional bytes for every subscriber to private inode data
isn't the end of the world, even if use of the resource is sparse.

Of course, experience would suggest, that getting everyone in this
community to agree on something is roughly akin to throwing a hand
grenade into a chicken coop with an expectation that all of the
chickens will fly out in a uniform flock formation.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
              https://github.com/Quixote-Project