RE: [Report] v4.19 TREE03 splat (stable kernel with ChromeOS patches)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Dec 13, 2022 at 8:21 AM Zhang, Qiang1 <qiang1.zhang@xxxxxxxxx> wrote:
>
> > On Dec 12, 2022, at 6:45 AM, Zhang, Qiang1 <qiang1.zhang@xxxxxxxxx> wrote:
> >
> > On Fri, Dec 9, 2022 at 11:34 PM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >> Hello,
> >>
> >> I see this an hour into the run for TREE03 on v4.19:
> >> tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 32 --duration 60
> >>
> >> Checking to see if Qiang has any thoughts as I saw him comment about a similar
> >> issue in [1].
> >>
> >> [ 3243.844445] ------------[ cut here ]------------^M
> >> [ 3243.847112] WARNING: CPU: 1 PID: 0 at kernel/kthread.c:411 __kthread_bind_mask+0x19/0x60^M
> >> [ 3243.851585] Modules linked in:^M
> >> [ 3243.853295] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.19.267+ #1^M
> >> [ 3243.856699] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014^M
> >> [ 3243.860052] RIP: 0010:__kthread_bind_mask+0x19/0x60^M
> >> [ 3243.861769] Code: 48 89 42 20 c3 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 55 41 54 55 48 89 f5 48 89 d6 53 48 89 fb e8 3c e9 00 00 48 85 c0 75 09 <0f> 0b 5b 5d 41 5c 41 5d c3 4c 8d ab b4 07 00 00 4c 89 ef e8 6f 28^M
> >> [ 3243.867751] RSP: 0000:ffffa54b8014fec0 EFLAGS: 00010246^M
> >> [ 3243.868959] RAX: 0000000000000000 RBX: ffff95555ee3c240 RCX: 0000000000000000^M
> >> [ 3243.870606] RDX: ffff95555f063d80 RSI: 0000000000000246 RDI: 00000000ffffffff^M
> >> [ 3243.872246] RBP: ffffffffb720a010 R08: af82bdcb11f9f4ff R09: 0000000000000000^M
> >> [ 3243.873894] R10: ffffa54b8014fe70 R11: 00000000621ccdc2 R12: 0000000000000000^M
> >> [ 3243.875538] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000^M
> >> [ 3243.877181] FS:  0000000000000000(0000) GS:ffff95555f040000(0000) knlGS:0000000000000000^M
> >> [ 3243.878698] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> >> [ 3243.879747] CR2: 0000000000000000 CR3: 000000001300a000 CR4: 00000000000006e0^M
> >> [ 3243.881050] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
> >> [ 3243.882358] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
> >> [ 3243.883660] Call Trace:^M
> >> [ 3243.884925]  kthread_unpark+0x52/0x60^M
> >> [ 3243.885613]  cpuhp_online_idle+0x31/0x50^M
> >> [ 3243.886341]  cpu_startup_entry+0x62/0x70^M
> >> [ 3243.887063]  start_secondary+0x186/0x1b0^M
> >> [ 3243.887783]  secondary_startup_64+0xa4/0xb0^M
> >>
> >> [1] https://groups.google.com/g/syzkaller-bugs/c/w_LARy6pxvQ/m/dKjQyHAxAQAJ
> >>
> >>
> >> This set of scheduler patches seems to make it go away, however I am
> >> running a long weekend test to collect more data:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/log/?h=rcu/torture/v4.19.fixes.wip.120922
> >
> >
> > Hi Joel
> >
> > I also used the link above to do the TREE03 test, looks like the problem still exists:
> >
> > 5834 [  966.016205]  kthread_unpark+0x50/0x60^M
> > 5835 [  966.016440]  cpuhp_online_idle+0x31/0x50^M
> > 5836 [  966.016678]  cpu_startup_entry+0x65/0x70^M
> > 5837 [  966.016913]  start_secondary+0x18b/0x1c0^M
> > 5838 [  966.017148]  secondary_startup_64+0xa4/0xb0^M
> > 5839 [  966.017407] ---[ end trace 636953f76b8055db ]---^M
> > 5840 [  966.017691] migration/1     R  running task    14952    18      2 0x80000000 last_sleep: 961405359084.  last_runna      ble: 961700816631^M
> > 5841 [  966.018406] Call Trace:^M
> > 5842 [  966.018559]  __schedule+0x75f/0x1320^M
> > 5843 [  966.018775]  ? cpuhp_invoke_callback+0x88/0x600^M
> > 5844 [  966.019042]  ? ___preempt_schedule+0x16/0x18^M
> > 5845 [  966.019302]  ? sort_range+0x20/0x20^M
> > 5846 [  966.019515]  preempt_schedule_common+0x32/0x80^M
> > 5847 [  966.019776]  ___preempt_schedule+0x16/0x18^M
> > 5848 [  966.020018]  _raw_spin_unlock_irq+0x1f/0x20^M
> > 5849 [  966.020275]  smpboot_thread_fn+0x195/0x230^M
> > 5850 [  966.020525]  kthread+0x139/0x160^M
> > 5851 [  966.020717]  ? kthread_create_worker_on_cpu+0x60/0x60^M
> > 5852 [  966.021014]  ret_from_fork+0x35/0x40^M
> >
> > From the calltrace, the state of migration kthreads is not correct,
> > I'm also looking at what's going wrong.
> >
> >Thanks for checking. I also noticed that the state of those threads is running instead of parked, causing the warning.
> >
> >- Joel
>
> Hi Joel
>
> I find the TREE03 enable CONFIG_SCHED_CORE = y, under this condition, if the cpu goes offline,
> we will directly select the idle kthread, which leads to it preempting the migration kthread.
> through calltrace, it can also be found , the  migration/1  kthread is preempted and schedule out.
>
> [  966.017691] migration/1     R  running task    14952
> [  966.018559]  __schedule+0x75f/0x1320^M
> [  966.018775]  ? cpuhp_invoke_callback+0x88/0x600^M
> [  966.019042]  ? ___preempt_schedule+0x16/0x18^M
> [  966.019302]  ? sort_range+0x20/0x20^M
> [  966.019515]  preempt_schedule_common+0x32/0x80^M
> [  966.019776]  ___preempt_schedule+0x16/0x18^M
> [  966.020018]  _raw_spin_unlock_irq+0x1f/0x20^M
> [  966.020275]  smpboot_thread_fn+0x195/0x230^M
> [  966.020525]  kthread+0x139/0x160^M
> [  966.020717]  ? kthread_create_worker_on_cpu+0x60/0x60^M
> [  966.021014]  ret_from_fork+0x35/0x40^M
>
> The following modification can fix this warning(this change comes from the new kernel version).
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0acbc7706d71..1c76226a2c26 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4109,8 +4109,10 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>         bool fi_before = false;
>
>         cpu = cpu_of(rq);
> -       if (cpu_is_offline(cpu))
> -               return idle_sched_class.pick_next_task(rq, prev, rf);
> +       if (cpu_is_offline(cpu)) {
> +               rq->core_pick = NULL;
> +               return __pick_next_task(rq, prev, rf);
> +       }
>
>Wow hats off at that debugging and thank you so much, I will test this
>more and let you know. I think CONFIG_SCHED_CORE might be enabled by
>default on the ChromeOS 4.19 kernel, which is why we see it and the
>regular 4.19 stable kernel does not.
>
>Anyway thank you again and kudos on the awesome debug. I will keep you
>posted on my testing.

No problem, most of my work is also done on older kernel versions.

Thanks
Zqiang

>
>Thanks,
>
> - Joel




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux