Hi On Tue, Apr 25, 2023 at 9:40 PM Christophe Leroy <christophe.leroy@xxxxxxxxxx> wrote: > > > > Le 25/04/2023 à 13:06, Joel Fernandes a écrit : > > On Tue, Apr 25, 2023 at 6:58 AM Zhouyi Zhou <zhouzhouyi@xxxxxxxxx> wrote: > >> > >> hi > >> > >> On Tue, Apr 25, 2023 at 6:13 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > >>> > >>> On Mon, Apr 24, 2023 at 02:55:11PM -0400, Joel Fernandes wrote: > >>>> This is amazing debugging Boqun, like a boss! One comment below: > >>>> > >>>>>>> Or something simple I haven't thought of? :) > >>>>>> > >>>>>> At what points can r13 change? Only when some particular functions are > >>>>>> called? > >>>>>> > >>>>> > >>>>> r13 is the local paca: > >>>>> > >>>>> register struct paca_struct *local_paca asm("r13"); > >>>>> > >>>>> , which is a pointer to percpu data. > >>>>> > >>>>> So if a task schedule from one CPU to anotehr CPU, the value gets > >>>>> changed. > >>>> > >>>> It appears the whole issue, per your analysis, is that the stack > >>>> checking code in gcc should not cache or alias r13, and must read its > >>>> most up-to-date value during stack checking, as its value may have > >>>> changed during a migration to a new CPU. > >>>> > >>>> Did I get that right? > >>>> > >>>> IMO, even without a reproducer, gcc on PPC should just not do that, > >>>> that feels terribly broken for the kernel. I wonder what clang does, > >>>> I'll go poke around with compilerexplorer after lunch. > >>>> > >>>> Adding +Peter Zijlstra as well to join the party as I have a feeling > >>>> he'll be interested. ;-) > >>> > >>> I'm a little confused; the way I understand the whole stack protector > >>> thing to work is that we push a canary on the stack at call and on > >>> return check it is still valid. Since in general tasks randomly migrate, > >>> the per-cpu validation canary should be the same on all CPUs. > >>> > >>> Additionally, the 'new' __srcu_read_{,un}lock_nmisafe() functions use > >>> raw_cpu_ptr() to get 'a' percpu sdp, preferably that of the local cpu, > >>> but no guarantees. > >>> > >>> Both cases use r13 (paca) in a racy manner, and in both cases it should > >>> be safe. > >> New test results today: both gcc build from git (git clone > >> git://gcc.gnu.org/git/gcc.git) and Ubuntu 22.04 gcc-12.1.0 > >> are immune from the above issue. We can see the assembly code on > >> http://140.211.169.189/0425/srcu_gp_start_if_needed-gcc-12.txt > >> > >> while > >> Both native gcc on PPC vm (gcc version 9.4.0), and gcc cross compiler > >> on my x86 laptop (gcc version 10.4.0) will reproduce the bug. > > > > Do you know what fixes the issue? I would not declare victory yet. My > > feeling is something changes in timing, or compiler codegen which > > hides the issue. So the issue is still there but it is just a matter > > of time before someone else reports it. > > > > Out of curiosity for PPC folks, why cannot 64-bit PPC use per-task > > canary? Michael, is this an optimization? Adding Christophe as well > > since it came in a few years ago via the following commit: > > It uses per-task canary. But unlike PPC32, PPC64 doesn't have a fixed > register pointing to 'current' at all time so the canary is copied into > a per-cpu struct during _switch(). > > If GCC keeps an old value of the per-cpu struct pointer, it then gets > the canary from the wrong CPU struct so from a different task. This is a fruitful learning process for me! Christophe: Do you think there is still a need to bisect GCC ? If so, I am very glad to continue Cheers Zhouyi > > Christophe > > > > > commit 06ec27aea9fc84d9c6d879eb64b5bcf28a8a1eb7 > > Author: Christophe Leroy <christophe.leroy@xxxxxx> > > Date: Thu Sep 27 07:05:55 2018 +0000 > > > > powerpc/64: add stack protector support > > > > On PPC64, as register r13 points to the paca_struct at all time, > > this patch adds a copy of the canary there, which is copied at > > task_switch. > > That new canary is then used by using the following GCC options: > > -mstack-protector-guard=tls > > -mstack-protector-guard-reg=r13 > > -mstack-protector-guard-offset=offsetof(struct paca_struct, canary)) > > > > Signed-off-by: Christophe Leroy <christophe.leroy@xxxxxx> > > Signed-off-by: Michael Ellerman <mpe@xxxxxxxxxxxxxx> > > > > - Joel