On Fri, 27 Dec 2024, Magnus Lindholm wrote: > The code generated by gcc from smp.c reserves 96 bytes of stack space > but places csd_stack struct on $sp+79. Since sizeof csd_stack is 32 > bytes, it seems to me that [($sp+79) & NOT(0x1f) + > sizeof(call_single_data_t)] might be greater than "96+$sp" if say, bit > 3 and 4 are set in $sp? or am I missing something here? Umm, no. The psABI guarantees 16-byte alignment for the stack pointer, and under this condition (((x - 17) & ~31) + 32 <= x) is guaranteed to be true (except for the overflow case, of course, which does not apply here). > --------------------------- > lda sp,-96(sp) > ... > lda s0,79(sp) > ... > andnot s0,0x1f,s0 > ... > stq zero,8(s0) > stq zero,0(s0) > stq zero,16(s0) > stq zero,24(s0) > stl t0,8(s0) [.node = CSD_FLAG_LOCK | CSD_TYPE_SYNC] So `csd_stack' isn't at ($old_sp + 79) but rather (($old_sp + 79) & ~31), which is just a way to guarantee 32-byte alignment for the data object where the psABI only guarantees 16-byte alignment for stack allocations. > 2) > Using cacheline_aligned_in_smp when declaring csd_stack in > smp_call_function_single will actually reserve less stack space (80 > bytes in stead of 96), csd_stack is referenced directly using $sp. > Maybe alignment is just a way to simplify things for gcc and avoid > hitting compiler bugs? For the Alpha port generic definitions are used, so I wouldn't draw such conclusions. It's just how things work out by default, and objects of the `call_single_data_t' type are aligned to their size, which is 32, while ____cacheline_aligned_in_smp requests alignment to either 32 or 64 bytes, depending on the kernel configuration... > -------------------------- > lda sp,-80(sp) > stq zero,48(sp) > stq zero,64(sp) > stq zero,72(sp) > stl t0,56(sp) [.node = CSD_FLAG_LOCK | CSD_TYPE_SYNC] ... so I don't really know why it causes the alignment to be decreased here back to the psABI value of 16 bytes. > I've made numerous attempts with different versions of GCC, including > the most recent git version (with and without the patches from Maciej) > and they give similar results, even though the exact amount of > stackspace reserved, registers used, and placement of csd_stack struct > will differ somewhat. (GCC) 15.0.0 20241225, with Maciej patches > applied, will produce the code below: > > lda sp,-112(sp) > lda t1,47(sp) > andnot t1,0x1f,t1 > ... > > Which boots and lets me load/unload my scsi kernel moduel, but just > adding some debug print statement to smp_call_function_single will > again give a kernel null pointer exception. Printing the value of &csd > seems to allocate space for csd on the stack instead of keeping it in > registers which will later trigger a null pointer excepting when > accessed. To me it seems like this just moves around the stack > clobbering problem? A code generation bug cannot be ruled out of course. But it might be an ISO C compliance bug too that just happens to trigger for this particular scenario. > CPU 1 > rmmod(1444): Oops 1 > pc = [<fffffc000078e818>] ra = [<fffffc00003dd0f8>] ps = 0000 Not tainted > pc is at llist_add_batch+0x8/0x50 > ra is at __smp_call_single_queue+0x38/0xa0 > v0 = 0000000000000000 t0 = fffffc0000e2b100 t1 = fffffc0000ec4048 > t2 = 0000000000000000 t3 = fffffc0000ec4048 t4 = 0000000000000000 > t5 = 0000000000000001 t6 = ffffffffffffffec t7 = fffffc0005d4c000 > s0 = 0000000000000000 s1 = 0000000000000001 s2 = 0000000000000001 > s3 = 0000000000000001 s4 = fffffc0000cd0330 s5 = fffffc000020ee80 > s6 = 00000200010422a0 > a0 = 0000000000000000 a1 = 0000000000000000 a2 = fffffc000020f100 > a3 = fffffc0005d4fa28 a4 = ffff1020ffffff00 a5 = 0000000000000000 > t8 = 0000000000000001 t9 = 0000000000000001 t10= 0000000000000000 > t11= 0000000000000000 pv = fffffc000078e810 at = 0000000000000000 > gp = fffffc0000e9c980 sp = 00000000905861a6 > Disabling lock debugging due to kernel taint > Trace: > [<fffffc00003dd1bc>] generic_exec_single+0x5c/0x150 > [<fffffc00003dd3ec>] smp_call_function_single+0x13c/0x220 > [<fffffc000082ceec>] device_release+0x3c/0xf0 > [<fffffc00003ae178>] rcu_barrier+0x1b8/0x4d0 > [<fffffc00003aaa30>] rcu_barrier_handler+0x0/0x120 > [<fffffc00003aaa30>] rcu_barrier_handler+0x0/0x120 > [<fffffc0000858418>] scsi_host_dev_release+0x58/0x170 > [<fffffc000082cf04>] device_release+0x54/0xf0 > [<fffffc0000b501f0>] kobject_put+0x90/0x1b0 > [<fffffc000082d0fc>] put_device+0x1c/0x30 > [<fffffc00008583ac>] scsi_host_put+0x1c/0x30 > [<fffffc00007b9694>] pci_device_remove+0x34/0x90 > [<fffffc0000838284>] device_remove+0x64/0xb0 > [<fffffc0000839d24>] device_release_driver_internal+0x284/0x370 > [<fffffc0000839ecc>] driver_detach+0x7c/0x110 > [<fffffc00008377e8>] bus_remove_driver+0x98/0x160 > [<fffffc000083a754>] driver_unregister+0x44/0xa0 > [<fffffc00007b94e8>] pci_unregister_driver+0x38/0xd0 > [<fffffc00003be264>] sys_delete_module+0x174/0x2f0 > [<fffffc000031095c>] entMM+0x9c/0xc0 > [<fffffc0000310d04>] entSys+0xa4/0xc0 > [<fffffc0000310d04>] entSys+0xa4/0xc0 > > Code: > f43ffffb > 6bfa8001 > 47ff041f > 2ffe0000 > a4120000 <--- ldq v0,0(a2) (*first = READ_ONCE(head->first);) > 60004000 <--- mb > <b4110000> <--- stq v0,0(a1) (new_last->next = first;) > 60004000 So `__smp_call_single_queue' is called with NULL `node'. Would you be able to trace it back further, e.g. by adding BUG_ON(!node) to `__smp_call_single_queue' and so on if required, to see where this NULL pointer comes from originally? I do hope such a minimal probe won't disturb code generation enough for this to become a heisenbug. Maciej