Re: [PATCH v3 4/6] mm/slab: Introduce kmem_buckets_create() and family

Vlastimil Babka <vbabka@xxxxxxx> · Fri, 24 May 2024 15:43:33 +0200

On 4/24/24 11:41 PM, Kees Cook wrote:
> Dedicated caches are available for fixed size allocations via
> kmem_cache_alloc(), but for dynamically sized allocations there is only
> the global kmalloc API's set of buckets available. This means it isn't
> possible to separate specific sets of dynamically sized allocations into
> a separate collection of caches.
> 
> This leads to a use-after-free exploitation weakness in the Linux
> kernel since many heap memory spraying/grooming attacks depend on using
> userspace-controllable dynamically sized allocations to collide with
> fixed size allocations that end up in same cache.
> 
> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> against these kinds of "type confusion" attacks, including for fixed
> same-size heap objects, we can create a complementary deterministic
> defense for dynamically sized allocations that are directly user
> controlled. Addressing these cases is limited in scope, so isolation these
> kinds of interfaces will not become an unbounded game of whack-a-mole. For
> example, pass through memdup_user(), making isolation there very
> effective.
> 
> In order to isolate user-controllable sized allocations from system
> allocations, introduce kmem_buckets_create(), which behaves like
> kmem_cache_create(). Introduce kmem_buckets_alloc(), which behaves like
> kmem_cache_alloc(). Introduce kmem_buckets_alloc_track_caller() for
> where caller tracking is needed. Introduce kmem_buckets_valloc() for
> cases where vmalloc callback is needed.
> 
> Allows for confining allocations to a dedicated set of sized caches
> (which have the same layout as the kmalloc caches).
> 
> This can also be used in the future to extend codetag allocation
> annotations to implement per-caller allocation cache isolation[1] even
> for dynamic allocations.
> 
> Memory allocation pinning[2] is still needed to plug the Use-After-Free
> cross-allocator weakness, but that is an existing and separate issue
> which is complementary to this improvement. Development continues for
> that feature via the SLAB_VIRTUAL[3] series (which could also provide
> guard pages -- another complementary improvement).
> 
> Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
> Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@xxxxxxxxxx/ [3]
> Signed-off-by: Kees Cook <keescook@xxxxxxxxxxxx>

So this seems like it's all unconditional and not depending on a config
option? I'd rather see one, as has been done for all similar functionality
before, as not everyone will want this trade-off.

Also AFAIU every new user (patches 5, 6) will add new bunch of lines to
/proc/slabinfo? And when you integrate alloc tagging, do you expect every
callsite will do that as well? Any idea how many there would be and what
kind of memory overhead it will have in the end?