Re: Kernel Oops on alpha with kernel version >=6.9.x

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi again.

I've been running some more tests, this time with a smp kernel but on
a system with just one cpu, seems to me as a bit simpler scenario to
analyze. I've added some print statements to smp_call_function_single,
just to see what's really going on:

pr_warn("smp_call_function_single: %llx %llx
size=%d\n",&csd_stack,&csd, sizeof(call_single_data_t));

output is seen below:
smp: smp_call_function_single: fffffc000493fc40 fffffc000493fc58 size=32
so, the csd_stack struct is 32-bytes in size but &csd - &csd_stack =
24. This does not make any sense?


pr_warn("\n&csd_stack.info=%lx\n&csd=%lx\n",&csd_stack.info,&csd);

output according to below:
smp:
&csd_stack.info=fffffc000493fc58
&csd=fffffc000493fc58

Here csd variable has the same address on the stack as csd_stack.info.

Using above information and locking at the disassembly of
smp_call_function_single in smp.o I've put together the following
table mapping out the stack of smp_call_function_single:


$sp+0 ra
$sp+8 s0
$sp+16
$sp+24
$sp+32 csd_stack.node 0xfffffc000493fc40
$sp+40 csd_stack.node 0xfffffc000493fc48
$sp+48 csd_stack.func 0xfffffc000493fc50
$sp+56 csd_stack.info 0xfffffc000493fc58
$sp+64 csd 0xfffffc000493fc58
$sp+72
$sp+80 a3
$sp+88 a2
$sp+96 a0
$sp+104 a1
$sp+112 -


When requesting csd_stack to be aligned using
__attribute__((__aligned__(x))) it seems as if the compiler does not
leave enough room above the csd_stack struct. i.e since the exact
location of csd_stack depends on the actual value of $sp it is not
known at compile time. Seems like gcc does not take this into account.
The code works fine if I remove the alignment attribute for csd_stack.
Also as previously mentioned, declaring csd_stack as "struct
____cacheline_aligned_in_smp" makes it work, but judging from the
disassembly code, this statement has no effect on the alignment of
csd_stack, i.e csd_stack is not aligned to anything its simply just
placed on the stack, indirectly making it just 16-byte aligned instead
of the requested 32-byte alignment.

It seems to me that, when used to align variables that reside on the
stack,  __attribute__((__aligned__(x))) does not work correctly with
gcc/alpha/linux.


/Magnus

On Tue, Dec 31, 2024 at 11:43 AM Magnus Lindholm <linmag7@xxxxxxxxx> wrote:
>
> >  Umm, no.  The psABI guarantees 16-byte alignment for the stack pointer,
> > and under this condition (((x - 17) & ~31) + 32 <= x) is guaranteed to be
> > true (except for the overflow case, of course, which does not apply here).
>
> aha! that explains it! thanks, is the psABI available somewhere?
>
> >
> >  Would you be able to trace it back further, e.g. by adding BUG_ON(!node)
> > to `__smp_call_single_queue' and so on if required, to see where this NULL
> > pointer comes from originally?  I do hope such a minimal probe won't
> > disturb code generation enough for this to become a heisenbug.
> >
> Hi, below are some additional test that I've made,
>
> It seems to me that part of the stack is overwritten with the values
> of other local variables. Previously this affected the return address
> on the stack causing kernels Oops on function return (see previous
> mail in thread). In this run it seems like when the pointer *csd in
> smp_call_function_single is stored on the stack, it gets overwritten
> by writes to csd_stack.info. The difference here is that I use GCC
> 15.0.0 20241225 (experimental) instead of gcc (Gentoo 14.2.1_p20241116
> p3) 14.2.1. To me, this looks like the same problem but the clobbering
> just hits a different part of the stack. Below is some debug-output,
> where I've added some print statements to the code in
> smp_call_function_single of smp.c. When csd_stack is declared as
> "struct ____cacheline_aligned_in_smp __call_single_data csd_stack",
> this first case is the case where the code works:
>
> unloading the scsi module:
> --------------------------------------------------
> smp:
> &csd_stack.info=fffffc000493fd90
> &csd=fffffc000493fd98
> smp:
> &csd_stack.info=fffffc000493fd90
> &csd=fffffc000493fd98
> sd 6:0:1:0: [sdb] Synchronizing SCSI cache
> rcu: rcu_barrier: cpu=0
> smp:
> &csd_stack.info=fffffc000935bc50
> &csd=fffffc000935bc58
> rcu: rcu_barrier: cpu=1
> smp:
> &csd_stack.info=fffffc000935bc50
> &csd=fffffc000935bc58
> rcu: rcu_barrier: cpu=2
> smp:
> &csd_stack.info=fffffc000935bc50
> &csd=fffffc000935bc58
> smp: generic_exec_single: csd=fffffc000935bc38 cpu=2 smp_cpu=2
>
>
>
> Below is the same debug output when csd_stack is declared as
> "call_single_data_t csd_stack" (i.e. no patch applied). For some
> reason, in this case, the address of the csd variable is the same as
> the address of csd_stack.info. If this is really the case, no wonder
> that a write to csd_stack.info will overwrite the csd pointer. In this
> case the code fails according to below:
>
> unloading the scsi module:
> -----------------------------------------
> smp:
> &csd_stack.info=fffffc000493fd98
> &csd=fffffc000493fd98
> smp: smp_call_function_single: not wait smp_cpu=1
> sd 6:0:1:0: [sdb] Synchronizing SCSI cache
> rcu: rcu_barrier: cpu=0
> smp:
> &csd_stack.info=fffffc0006207c58
> &csd=fffffc0006207c58
> smp: generic_exec_single: csd=fffffc0006207c40 cpu=0 smp_cpu=0
> Unable to handle kernel paging request at virtual address 0000000000000008
> CPU 0
> rmmod(1443): Oops 0
> pc = [<fffffc00003dd564>]  ra = [<fffffc00003dd558>]  ps = 0000    Not tainted
> pc is at smp_call_function_single+0x204/0x220
> ra is at smp_call_function_single+0x1f8/0x220
>
>
>
>
> Below is yet another test, here the code works, csd_stack is declared
> as "call_single_data_t csd_stack" (i.e. no patch applied). In this
> example the code works since I've added some extra "dummy variables"
> on the stack which seems to steer things around enough. Here it's also
> clear that the address of csd does not overlap with the address of
> csd_stack.info. test0 and test1 are just the extra local variables
> that I've added.
>
> -----------------------------------------
> smp:
> &csd_stack.info=fffffc000493fd78
> &csd=fffffc000493fd90
> smp: smp_call_function_single: not wait smp_cpu=1
> smp: &test0=fffffc000493fd98
> smp: &test1=fffffc000493fd88
> sd 6:0:1:0: [sdb] Synchronizing SCSI cache
> rcu: rcu_barrier: cpu=0
> smp:
> &csd_stack.info=fffffc0009e07c38
> &csd=fffffc0009e07c50
> smp: &test0=fffffc0009e07c58
> smp: &test1=fffffc0009e07c48
> smp: generic_exec_single: csd=fffffc0009e07c20 cpu=0 smp_cpu=0
>
>
>
>
> Patch I used to "fix" kernel/smp.c
> ----------------------------------------------------
> +++ kernel/smp.c        2024-12-19 19:01:20.592819628 +0100
> @@ -631,7 +631,7 @@
> int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
>                              int wait)
>  {
>         call_single_data_t *csd;
> -       call_single_data_t csd_stack = {
> +       struct ____cacheline_aligned_in_smp __call_single_data csd_stack = {
>                 .node = { .u_flags = CSD_FLAG_LOCK | CSD_TYPE_SYNC, },
>         };
>         int this_cpu;
>
>
>
> /Magnus





[Index of Archives]     [Netdev]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux