Re: Kernel Oops on alpha with kernel version >=6.9.x

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



>  Umm, no.  The psABI guarantees 16-byte alignment for the stack pointer,
> and under this condition (((x - 17) & ~31) + 32 <= x) is guaranteed to be
> true (except for the overflow case, of course, which does not apply here).

aha! that explains it! thanks, is the psABI available somewhere?

>
>  Would you be able to trace it back further, e.g. by adding BUG_ON(!node)
> to `__smp_call_single_queue' and so on if required, to see where this NULL
> pointer comes from originally?  I do hope such a minimal probe won't
> disturb code generation enough for this to become a heisenbug.
>
Hi, below are some additional test that I've made,

It seems to me that part of the stack is overwritten with the values
of other local variables. Previously this affected the return address
on the stack causing kernels Oops on function return (see previous
mail in thread). In this run it seems like when the pointer *csd in
smp_call_function_single is stored on the stack, it gets overwritten
by writes to csd_stack.info. The difference here is that I use GCC
15.0.0 20241225 (experimental) instead of gcc (Gentoo 14.2.1_p20241116
p3) 14.2.1. To me, this looks like the same problem but the clobbering
just hits a different part of the stack. Below is some debug-output,
where I've added some print statements to the code in
smp_call_function_single of smp.c. When csd_stack is declared as
"struct ____cacheline_aligned_in_smp __call_single_data csd_stack",
this first case is the case where the code works:

unloading the scsi module:
--------------------------------------------------
smp:
&csd_stack.info=fffffc000493fd90
&csd=fffffc000493fd98
smp:
&csd_stack.info=fffffc000493fd90
&csd=fffffc000493fd98
sd 6:0:1:0: [sdb] Synchronizing SCSI cache
rcu: rcu_barrier: cpu=0
smp:
&csd_stack.info=fffffc000935bc50
&csd=fffffc000935bc58
rcu: rcu_barrier: cpu=1
smp:
&csd_stack.info=fffffc000935bc50
&csd=fffffc000935bc58
rcu: rcu_barrier: cpu=2
smp:
&csd_stack.info=fffffc000935bc50
&csd=fffffc000935bc58
smp: generic_exec_single: csd=fffffc000935bc38 cpu=2 smp_cpu=2



Below is the same debug output when csd_stack is declared as
"call_single_data_t csd_stack" (i.e. no patch applied). For some
reason, in this case, the address of the csd variable is the same as
the address of csd_stack.info. If this is really the case, no wonder
that a write to csd_stack.info will overwrite the csd pointer. In this
case the code fails according to below:

unloading the scsi module:
-----------------------------------------
smp:
&csd_stack.info=fffffc000493fd98
&csd=fffffc000493fd98
smp: smp_call_function_single: not wait smp_cpu=1
sd 6:0:1:0: [sdb] Synchronizing SCSI cache
rcu: rcu_barrier: cpu=0
smp:
&csd_stack.info=fffffc0006207c58
&csd=fffffc0006207c58
smp: generic_exec_single: csd=fffffc0006207c40 cpu=0 smp_cpu=0
Unable to handle kernel paging request at virtual address 0000000000000008
CPU 0
rmmod(1443): Oops 0
pc = [<fffffc00003dd564>]  ra = [<fffffc00003dd558>]  ps = 0000    Not tainted
pc is at smp_call_function_single+0x204/0x220
ra is at smp_call_function_single+0x1f8/0x220




Below is yet another test, here the code works, csd_stack is declared
as "call_single_data_t csd_stack" (i.e. no patch applied). In this
example the code works since I've added some extra "dummy variables"
on the stack which seems to steer things around enough. Here it's also
clear that the address of csd does not overlap with the address of
csd_stack.info. test0 and test1 are just the extra local variables
that I've added.

-----------------------------------------
smp:
&csd_stack.info=fffffc000493fd78
&csd=fffffc000493fd90
smp: smp_call_function_single: not wait smp_cpu=1
smp: &test0=fffffc000493fd98
smp: &test1=fffffc000493fd88
sd 6:0:1:0: [sdb] Synchronizing SCSI cache
rcu: rcu_barrier: cpu=0
smp:
&csd_stack.info=fffffc0009e07c38
&csd=fffffc0009e07c50
smp: &test0=fffffc0009e07c58
smp: &test1=fffffc0009e07c48
smp: generic_exec_single: csd=fffffc0009e07c20 cpu=0 smp_cpu=0




Patch I used to "fix" kernel/smp.c
----------------------------------------------------
+++ kernel/smp.c        2024-12-19 19:01:20.592819628 +0100
@@ -631,7 +631,7 @@
int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
                             int wait)
 {
        call_single_data_t *csd;
-       call_single_data_t csd_stack = {
+       struct ____cacheline_aligned_in_smp __call_single_data csd_stack = {
                .node = { .u_flags = CSD_FLAG_LOCK | CSD_TYPE_SYNC, },
        };
        int this_cpu;



/Magnus




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux