> On Nov 19, 2024, at 10:14 AM, Casey Schaufler <casey@xxxxxxxxxxxxxxxx> wrote: > > On 11/19/2024 4:27 AM, Dr. Greg wrote: >> On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote: >> >>> Hi Christian, James and Jan, >> Good morning, I hope the day is starting well for everyone. >> >>>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: >>> [...] >>> >>>>> We can address this with something like following: >>>>> >>>>> #ifdef CONFIG_SECURITY >>>>> void *i_security; >>>>> #elif CONFIG_BPF_SYSCALL >>>>> struct bpf_local_storage __rcu *i_bpf_storage; >>>>> #endif >>>>> >>>>> This will help catch all misuse of the i_bpf_storage at compile >>>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y. >>>>> >>>>> Does this make sense? >>>> Got to say I'm with Casey here, this will generate horrible and failure >>>> prone code. >>>> >>>> Since effectively you're making i_security always present anyway, >>>> simply do that and also pull the allocation code out of security.c in a >>>> way that it's always available? That way you don't have to special >>>> case the code depending on whether CONFIG_SECURITY is defined. >>>> Effectively this would give everyone a generic way to attach some >>>> memory area to an inode. I know it's more complex than this because >>>> there are LSM hooks that run from security_inode_alloc() but if you can >>>> make it work generically, I'm sure everyone will benefit. >>> On a second thought, I think making i_security generic is not >>> the right solution for "BPF inode storage in tracing use cases". >>> >>> This is because i_security serves a very specific use case: it >>> points to a piece of memory whose size is calculated at system >>> boot time. If some of the supported LSMs is not enabled by the >>> lsm= kernel arg, the kernel will not allocate memory in >>> i_security for them. The only way to change lsm= is to reboot >>> the system. BPF LSM programs can be disabled at the boot time, >>> which fits well in i_security. However, BPF tracing programs >>> cannot be disabled at boot time (even we change the code to >>> make it possible, we are not likely to disable BPF tracing). >>> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some >>> BPF tracing programs to load at some point of time, and these >>> programs may use BPF inode storage. >>> >>> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory >>> always will be attached to i_security (maybe under a different >>> name, say, i_generic) of every inode. In this case, we should >>> really add i_bpf_storage directly to the inode, because another >>> pointer jump via i_generic gives nothing but overhead. >>> >>> Does this make sense? Or did I misunderstand the suggestion? >> There is a colloquialism that seems relevant here: "Pick your poison". >> >> In the greater interests of the kernel, it seems that a generic >> mechanism for attaching per inode information is the only realistic >> path forward, unless Christian changes his position on expanding >> the size of struct inode. >> >> There are two pathways forward. >> >> 1.) Attach a constant size 'blob' of storage to each inode. >> >> This is a similar approach to what the LSM uses where each blob is >> sized as follows: >> >> S = U * sizeof(void *) >> >> Where U is the number of sub-systems that have a desire to use inode >> specific storage. > > I can't tell for sure, but it looks like you don't understand how > LSM i_security blobs are used. It is *not* the case that each LSM > gets a pointer in the i_security blob. Each LSM that wants storage > tells the infrastructure at initialization time how much space it > wants in the blob. That can be a pointer, but usually it's a struct > with flags, pointers and even lists. > >> Each sub-system uses it's pointer slot to manage any additional >> storage that it desires to attach to the inode. > > Again, an LSM may choose to do it that way, but most don't. > SELinux and Smack need data on every inode. It makes much more sense > to put it directly in the blob than to allocate a separate chunk > for every inode. AFAICT, i_security is somehow unique in the way that its size is calculated at boot time. I guess we will just keep most LSM users behind. > >> This has the obvious advantage of O(1) cost complexity for any >> sub-system that wants to access its inode specific storage. >> >> The disadvantage, as you note, is that it wastes memory if a >> sub-system does not elect to attach per inode information, for example >> the tracing infrastructure. > > To be clear, that disadvantage only comes up if the sub-system uses > inode data on an occasional basis. If it never uses inode data there > is no need to have a pointer to it. > >> This disadvantage is parried by the fact that it reduces the size of >> the inode proper by 24 bytes (4 pointers down to 1) and allows future >> extensibility without colliding with the interests and desires of the >> VFS maintainers. > > You're adding a level of indirection. Even I would object based on > the performance impact. > >> 2.) Implement key/value mapping for inode specific storage. >> >> The key would be a sub-system specific numeric value that returns a >> pointer the sub-system uses to manage its inode specific memory for a >> particular inode. >> >> A participating sub-system in turn uses its identifier to register an >> inode specific pointer for its sub-system. >> >> This strategy loses O(1) lookup complexity but reduces total memory >> consumption and only imposes memory costs for inodes when a sub-system >> desires to use inode specific storage. > > SELinux and Smack use an inode blob for every inode. The performance > regression boggles the mind. Not to mention the additional complexity > of managing the memory. > >> Approach 2 requires the introduction of generic infrastructure that >> allows an inode's key/value mappings to be located, presumably based >> on the inode's pointer value. We could probably just resurrect the >> old IMA iint code for this purpose. >> >> In the end it comes down to a rather standard trade-off in this >> business, memory vs. execution cost. >> >> We would posit that option 2 is the only viable scheme if the design >> metric is overall good for the Linux kernel eco-system. > > No. Really, no. You need look no further than secmarks to understand > how a key based blob allocation scheme leads to tears. Keys are fine > in the case where use of data is sparse. They have no place when data > use is the norm. OTOH, I think some on-demand key-value storage makes sense for many other use cases, such as BPF (LSM and tracing), file lock, fanotify, etc. Overall, I think we have 3 types storages attached to inode: 1. Embedded in struct inode, gated by CONFIG_*. 2. Behind i_security (or maybe call it a different name if we can find other uses for it). The size is calculated at boot time. 3. Behind a key-value storage. To evaluate these categories, we have: Speed: 1 > 2 > 3 Flexibility: 3 > 2 > 1 We don't really have 3 right now. I think the direction is to create it. BPF inode storage is a key-value store. If we can get another user for 3, in addition to BPF inode storage, it should be a net win. Does this sound like a viable path forward? Thanks, Song