On Fri, Nov 1, 2024 at 10:58 PM Vlastimil Babka <vbabka@xxxxxxx> wrote: > > On 11/1/24 14:15, Hyeonggon Yoo wrote: > > On Fri, Nov 1, 2024 at 10:08 PM Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> wrote: > >> > >> When kobject_init_and_add() fails during cache creation, > >> kobj->name can be leaked because SLUB does not call kobject_put(), > >> which should be invoked per the kobject API documentation. > >> This has a bit of historical context, though; SLUB does not call > >> kobject_put() to avoid double-free for struct kmem_cache because > >> 1) simply calling it would free all resources related to the cache, and > >> 2) struct kmem_cache descriptor is always freed by cache_cache()'s > >> error handling path, causing struct kmem_cache to be freed twice. > >> > >> This issue can be reproduced by creating new slab caches while applying > >> failslab for kernfs_node_cache. This makes kobject_add_varg() succeed, > >> but causes kobject_add_internal() to fail in kobject_init_and_add() > >> during cache creation. > >> > >> Historically, this issue has attracted developers' attention several times. > >> Each time a fix addressed either the leak or the double-free, > >> it caused the other issue. Let's summarize a bit of history here: > >> > >> The leak has existed since the early days of SLUB. > >> > >> Commit 54b6a731025f ("slub: fix leak of 'name' in sysfs_slab_add") > >> introduced a double-free bug while fixing the leak. > >> > >> Commit 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate > >> sysfs filename") re-introduced the leak while fixing the double-free > >> error. > >> > >> Commit dde3c6b72a16 ("mm/slub: fix a memory leak in sysfs_slab_add()") > >> fixed the memory leak, but it was later reverted by commit 757fed1d0898 > >> ("Revert "mm/slub: fix a memory leak in sysfs_slab_add()"") to avoid > >> the double-free error. > >> > >> This is where we are now: we've chosen a memory leak over a double-free. > >> > >> To resolve this memory leak, skip creating sysfs files if it fails > >> and continue with cache creation regardless (as suggested by Christoph). > >> This resolves the memory leak because both the cache and the kobject > >> remain alive on kobject_init_and_add() failure. > >> > >> If SLUB tries to create an alias for a cache without sysfs files, > >> its symbolic link will not be generated. > >> > >> Since a slab cache might not have associated sysfs files, call kobject_del() > >> only if such files exist. > >> > >> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@xxxxxxxxx> > > Thanks, added to slab/for-next > > > > > [+Cc Jinjie, Liu] > > > >> --- > >> > >> RFC -> v1: Make sysfs optional instead of destroying the cache on sysfs > >> errors. (Suggested by Christoph) > > > > RFC version for reference: > > https://lore.kernel.org/linux-mm/20241021091413.154775-1-42.hyeyoo@xxxxxxxxx/ > > > >> mm/slub.c | 21 +++++++++++++++++---- > >> 1 file changed, 17 insertions(+), 4 deletions(-) > >> > >> diff --git a/mm/slub.c b/mm/slub.c > >> index 151a987dc3a0..b4b211468f77 100644 > >> --- a/mm/slub.c > >> +++ b/mm/slub.c > >> @@ -6116,7 +6116,8 @@ __kmem_cache_alias(const char *name, unsigned int size, unsigned int align, > >> s = find_mergeable(size, align, flags, name, ctor); > >> if (s) { > >> if (sysfs_slab_alias(s, name)) > >> - return NULL; > >> + pr_err("SLUB: Unable to add slab alias %s to sysfs\n", > > I've changed "slab" to "cache". > > >> + name); > >> > >> s->refcount++; > >> > >> @@ -6204,9 +6205,15 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name, > >> goto out; > >> } > >> > >> + /* > >> + * Failing to create sysfs files is not critical to SLUB functionality. > >> + * If it fails, proceed with cache creation without these files. > >> + */ > >> err = sysfs_slab_add(s); > >> - if (err) > >> - goto out; > >> + if (err) { > >> + err = 0; > >> + pr_err("SLUB: Unable to add slab %s to sysfs\n", s->name); > > Also here, and simplified to "if (sysfs_slab_add(s)) ... " to avoid dealing > with err. Oh no. err is initialized to -EINVAL, so that will not work as intended. It is causing the following list corruption. [ 0.607833] __kmem_cache_create_args(fscrypt_inode_info) failed with error -22 [ 0.608518] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc2+ #63 [ 0.609181] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 0.610000] Call Trace: [ 0.610233] <TASK> [ 0.610433] dump_stack_lvl+0x64/0x80 [ 0.610806] __kmem_cache_create_args+0x1eb/0x280 [ 0.611253] ? __pfx_fscrypt_init+0x10/0x10 [ 0.611647] fscrypt_init+0x88/0xf0 [ 0.611980] ? __pfx_fscrypt_init+0x10/0x10 [ 0.612373] do_one_initcall+0x5b/0x320 [ 0.612736] kernel_init_freeable+0x351/0x510 [ 0.613150] ? __pfx_kernel_init+0x10/0x10 [ 0.613536] kernel_init+0x1a/0x1d0 [ 0.613865] ret_from_fork+0x34/0x50 [ 0.614207] ? __pfx_kernel_init+0x10/0x10 [ 0.614591] ret_from_fork_asm+0x1a/0x30 [ 0.614968] </TASK> [ 0.615203] list_add corruption. prev->next should be next (ffff986bc2bb9aa0), but was ffff986bc2b6. [ 0.616308] ------------[ cut here ]------------ [ 0.616746] kernel BUG at lib/list_debug.c:32! [ 0.617173] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 0.617709] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12.0-rc2+ #63 [ 0.618372] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014 [ 0.619192] RIP: 0010:__list_add_valid_or_report+0x78/0xa0 [ 0.619714] Code: 8b ff 0f 0b 48 89 c1 48 c7 c7 50 ca c0 9a e8 2f f4 8b ff 0f 0b 48 89 d1 48 89 c6 b [ 0.621473] RSP: 0018:ffffa47380013c98 EFLAGS: 00010246 [ 0.621969] RAX: 0000000000000075 RBX: ffff986bc2b63238 RCX: ffffffff9b5646a8 [ 0.622638] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 [ 0.623313] RBP: ffff986bc2bb9ab8 R08: 0000000000000000 R09: 205d333032353136 [ 0.623996] R10: 74707572726f6320 R11: 6464615f7473696c R12: ffff986bc2bb9aa0 [ 0.624673] R13: ffff986bc2b63240 R14: ffff986bc2b63240 R15: ffff986bc2b631c0 [ 0.625355] FS: 0000000000000000(0000) GS:ffff986bdf400000(0000) knlGS:0000000000000000 [ 0.626111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.626653] CR2: ffff986bd3801000 CR3: 000000001242a000 CR4: 00000000000006f0 [ 0.627325] Call Trace: [ 0.627559] <TASK> [ 0.627760] ? die+0x36/0x90 [ 0.628039] ? do_trap+0xdd/0x100 [ 0.628354] ? __list_add_valid_or_report+0x78/0xa0 [ 0.628814] ? do_error_trap+0x6a/0x90 [ 0.629172] ? __list_add_valid_or_report+0x78/0xa0 [ 0.629633] ? exc_invalid_op+0x50/0x70 [ 0.630001] ? __list_add_valid_or_report+0x78/0xa0 [ 0.630463] ? asm_exc_invalid_op+0x1a/0x20 [ 0.630861] ? __list_add_valid_or_report+0x78/0xa0 [ 0.631324] ? __list_add_valid_or_report+0x78/0xa0 [ 0.631784] kobject_add_internal+0x78/0x2a0 [ 0.632192] kobject_init_and_add+0x8c/0xd0 [ 0.632589] ? kernfs_find_ns+0x35/0xc0 [ 0.632957] sysfs_slab_add+0x193/0x1e0 [ 0.633318] do_kmem_cache_create+0x455/0x630 [ 0.633727] __kmem_cache_create_args+0x157/0x280 [ 0.634176] ? __pfx_fsverity_init+0x10/0x10 [ 0.634586] fsverity_init_info_cache+0x66/0x90 [ 0.635022] fsverity_init+0x13/0x40 [ 0.635365] do_one_initcall+0x5b/0x320 [ 0.635734] kernel_init_freeable+0x351/0x510 [ 0.636154] ? __pfx_kernel_init+0x10/0x10 [ 0.636547] kernel_init+0x1a/0x1d0 [ 0.636881] ret_from_fork+0x34/0x50 [ 0.637227] ? __pfx_kernel_init+0x10/0x10 [ 0.637619] ret_from_fork_asm+0x1a/0x30 [ 0.637998] </TASK> [ 0.638211] Modules linked in: [ 0.638512] ---[ end trace 0000000000000000 ]--- [ 0.638962] RIP: 0010:__list_add_valid_or_report+0x78/0xa0 [ 0.639483] Code: 8b ff 0f 0b 48 89 c1 48 c7 c7 50 ca c0 9a e8 2f f4 8b ff 0f 0b 48 89 d1 48 89 c6 b [ 0.641253] RSP: 0018:ffffa47380013c98 EFLAGS: 00010246 [ 0.641753] RAX: 0000000000000075 RBX: ffff986bc2b63238 RCX: ffffffff9b5646a8 [ 0.642431] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001 [ 0.643118] RBP: ffff986bc2bb9ab8 R08: 0000000000000000 R09: 205d333032353136 [ 0.643795] R10: 74707572726f6320 R11: 6464615f7473696c R12: ffff986bc2bb9aa0 [ 0.644471] R13: ffff986bc2b63240 R14: ffff986bc2b63240 R15: ffff986bc2b631c0 [ 0.645152] FS: 0000000000000000(0000) GS:ffff986bdf400000(0000) knlGS:0000000000000000 [ 0.645923] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 0.646470] CR2: ffff986bd3801000 CR3: 000000001242a000 CR4: 00000000000006f0 [ 0.647156] note: swapper/0[1] exited with preempt_count 1 [ 0.647686] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 0.648446] Kernel Offset: 0x18000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000) [ 0.649466] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- > >> + } > >> > >> if (s->flags & SLAB_STORE_USER) > >> debugfs_slab_add(s); > >> @@ -7276,7 +7283,8 @@ static int sysfs_slab_add(struct kmem_cache *s) > >> > >> void sysfs_slab_unlink(struct kmem_cache *s) > >> { > >> - kobject_del(&s->kobj); > >> + if (s->kobj.state_in_sysfs) > >> + kobject_del(&s->kobj); > >> } > >> > >> void sysfs_slab_release(struct kmem_cache *s) > >> @@ -7305,6 +7313,11 @@ static int sysfs_slab_alias(struct kmem_cache *s, const char *name) > >> * If we have a leftover link then remove it. > >> */ > >> sysfs_remove_link(&slab_kset->kobj, name); > >> + /* > >> + * The original cache may have failed to generate sysfs file. > >> + * In that case, sysfs_create_link() returns -ENOENT and > >> + * symbolic link creation is skipped. > >> + */ > >> return sysfs_create_link(&slab_kset->kobj, &s->kobj, name); > >> } > >> > >> -- > >> 2.45.0 > >> >