On 15-06-21, 08:17, Qian Cai wrote: > On 6/15/2021 3:50 AM, Viresh Kumar wrote: > > This is a strange place to get the issue from. And this is a new > > issue. > > Well, it was still the same exercises with CPU online/offline. > > > > >> [ 488.151939][ T670] kthread+0x3ac/0x460 > >> [ 488.155854][ T670] ret_from_fork+0x10/0x18 > >> [ 488.160120][ T670] Code: 911e8000 aa1303e1 910a0000 941b595b (d4210000) > >> [ 488.166901][ T670] ---[ end trace e637e2d38b2cc087 ]--- > >> [ 488.172206][ T670] Kernel panic - not syncing: Oops - BUG: Fatal exception > >> [ 488.179182][ T670] SMP: stopping secondary CPUs > >> [ 489.209347][ T670] SMP: failed to stop secondary CPUs 0-1,10-11,16-17,31 > >> [ 489.216128][ T][ T670] Memoryn ]--- > > > > Can you give details on what exactly did you try to do, to get this ? > > Normal boot or something more ? > > Basically, it has the cpufreq driver as CPPC and the governor as > schedutil. Running a few workloads to get CPU scaling up and down. > Later, try to offline all CPUs until the last one and then online > all CPUs. Hmm, okay. So I basically have very similar setup with 8 cores (1-policy per-cpu), the only difference is I don't end up reading the performance counters, everything else remains same. So I should see issues now just like you, in case there are any. Since the insmod/rmmod setup is a bit different, this is what I tried today for around an hour with CONFIG_DEBUG_LIST and RCU debugging options. while true; do for i in `seq 1 7`; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done; for i in `seq 1 7`; do echo 1 > /sys/devices/system/cpu/cpu$i/online; done; done I don't see any crashes, oops or warnings with latest stuff. > I am hesitate to try this at the moment because this all feel like > shooting in the dark. I understand your point and you aren't completely wrong here. It wasn't completely in dark but since I am unable to reproduce the issue at my end, I asked for help. FWIW, I think one of the possible cause of corruption of kthread thing could have been because of the race in the topology related code. I already fixed that in my tree yesterday. > Ideally, you will be able to get access to one > of those arm64 servers (Huawei, Ampere, TX2, FJ etc) eventually and > really try the same exercises yourself with those debugging options > like list debugging and KASAN on. That way you could fix things way > efficiently. Yeah, I thought of this work being over and I am not a user of it normally. I had to enable it for ARM servers and I took help of my colleagues (Vincent Guittot and Ionela) for testing the same. I have also asked Vincent to give it a try again. > I could share you the .config once you are there. Last > but not least, once you get better narrow down of the issues, I'd > hope to see someone else familiar with the code there to get review > of those patches first (feel free to Cc me once you are ready to > post) before I'll rerun the whole things again. That way we don't > waste time on each other backing and forth chasing the shadow. I did send the stuff up for review and this last thing (you reported) was a different race altogether, so asked for testing without reviews. Anyway, I am quite sure my tests have covered such issues now. I will send out patches again soon. Thanks Qian. -- viresh