Re: [PATCH 6.10 000/809] 6.10.3-rc3 review

Vlastimil Babka <vbabka@xxxxxxx> · Tue, 6 Aug 2024 13:02:45 +0200

On 8/6/24 04:40, Linus Torvalds wrote:
> [ Let's drop random people and bring in Vlastimil ]

tglx was reproducing it so I add him back

> Vlastimil,
>  it turns out that the "this patch" is entirely a red herring, and the
> problem comes and goes randomly with just some code layout issues. See
> 
>    http://server.roeck-us.net/qemu/parisc64-6.10.3/
> 
> for more detail, particularly you'll see the "log.bad.gz" with the full log.

[    0.000000] BUG kmem_cache_node (Not tainted): objects 21 > max 16
[    0.000000] Slab 0x0000000041ed0000 objects=21 used=5 fp=0x00000000434003d0 flags=0x200(workingset|section=0|zone=0)

flags tell us this came from the partial list (workingset), there's no head flag so order-0

since the error was detected it basically throws the slab page away and tries another one

[    0.000000] BUG kmem_cache (Tainted: G    B             ): objects 25 > max 16
[    0.000000] Slab 0x0000000041ed0080 objects=25 used=6 fp=0x0000000043402790 flags=0x240(workingset|head|section=0|zone=0)

this was also from the partial list but head flag so at least order-1, two things are weird:
- max=16 is same as above even though it should be at least double as
slab page's order is larger
- objects=25 also isn't at least twice than objects=21

All the following are:
[    0.000000] BUG kmem_cache (Tainted: G    B             ): objects 25 > max 16
[    0.000000] Slab 0x0000000041ed0300 objects=25 used=1 fp=0x000000004340c150 flags=0x40(head|section=0|zone=0)

we depleted the partial list so it's allocating new slab pages, that are
also at least order-1

It looks like maxobj calculation is bogus, would be useful to see what values it
calculates from. I'm attaching a diff, but maybe it will also hide the issue...

If someone has a /proc/slabinfo from a working boot with otherwise same config
it might be also enough to guess what values should be expected there,
at least the s-size.

objects=21 vs 25 also seem odd though

used=5 with used=6 in the first two also suggests we already passed this code
successfully for creating a number of kmalloc caches and only then it started
failing, that's also weird.

> See also
> 
>    https://lore.kernel.org/all/87y15a4p4h.ffs@tglx/
> 
> for this thread.
> 
> I don't think this is really a slub issue, since it only happens on
> parisc, but maybe you can see what would make parisc different, and
> what could possibly make it all timing- or layout-dependent.
> 
>                  Linus
> 
> On Sun, 4 Aug 2024 at 11:36, Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
>>
>> With this patch in v6.10.3, all my parisc64 qemu tests get stuck with repeated error messages
>>
>> [    0.000000] =============================================================================
>> [    0.000000] BUG kmem_cache_node (Not tainted): objects 21 > max 16
>> [    0.000000] -----------------------------------------------------------------------------
>>
>> This never stops until the emulation aborts.

diff --git a/mm/slub.c b/mm/slub.c
index 4927edec6a8c..ec4ed5215f2f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1386,8 +1386,8 @@ static int check_slab(struct kmem_cache *s, struct slab *slab)
 
 	maxobj = order_objects(slab_order(slab), s->size);
 	if (slab->objects > maxobj) {
-		slab_err(s, slab, "objects %u > max %u",
-			slab->objects, maxobj);
+		slab_err(s, slab, "objects %u > max %u (order %d size %u)",
+			slab->objects, maxobj, slab_order(slab), s->size);
 		return 0;
 	}
 	if (slab->inuse > slab->objects) {