Am 12.04.2018 um 15:30 schrieb Vincent Guittot: > Heiner, Niklas, > > Le Thursday 12 Apr 2018 à 13:15:19 (+0200), Niklas Söderlund a écrit : >> Hi Vincent, >> >> Thanks for your feedback. >> >> On 2018-04-12 12:33:27 +0200, Vincent Guittot wrote: >>> Hi Niklas, >>> >>> On 12 April 2018 at 11:18, Niklas Söderlund >>> <niklas.soderlund@xxxxxxxxxxxx> wrote: >>>> Hi Vincent, >>>> >>>> I have observed issues running on linus/master from a few days back [1]. >>>> I'm running on a Renesas Koelsch board (arm32) and I can trigger a issue >>>> by X forwarding the v4l2 test application qv4l2 over ssh and moving the >>>> courser around in the GUI (best test case description award...). I'm >>>> sorry about the really bad way I trigger this but I can't do it in any >>>> other way, I'm happy to try other methods if you got some ideas. The >>>> symptom of the issue is a complete hang of the system for more then 30 >>>> seconds and then this information is printed in the console: >>> >>> Heiner (edded cc) also reported similar problem with his platform: a >>> dual core celeron >>> >>> Do you confirm that your platform is a dual cortex-A15 ? At least that >>> what I have seen on web >>> This would confirm that dual system is a key point. >> >> I can confirm that my platform is a dual core. >> >>> >>> The ssh connection is also common with Heiner's setup >> >> Interesting, I found Heiner's mail and I can confirm that I too >> experience ssh sessions lockups. I ssh into the system and by repeatedly >> hitting the return key I can lockup the board, while locked up starting >> another ssh session unblocks the first. If I don't start another ssh >> session but keep hitting return key sporadically in the first one I can >> get the trace I reported in my first mail to be printed on the serial >> console. >> >> When locked up the symptoms are that both the single ssh session is dead >> and the serial console. But attempting another ssh connection >> immediately unblocks both ssh and serial console. And if I allow enough >> time before starting the second ssh connection I can trigger a trace to >> be printed on the serial console, it's similar but different from the >> first I reported. >> >> [ 207.548610] 1-...!: (0 ticks this GP) idle=79a/1/1073741824 softirq=2146/2146 fqs=0 >> [ 207.556442] (detected by 0, t=12645 jiffies, g=333, c=332, q=20) >> [ 207.562546] rcu_sched kthread starved for 12645 jiffies! g333 c332 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=0 >> [ 207.572548] RCU grace-period kthread stack dump: >> >> [ 207.577166] rcu_sched R running task 0 9 2 0x00000000 >> [ 207.584389] Backtrace: >> [ 207.586849] [<c0760450>] (__schedule) from [<c0760ba4>] (schedule+0x94/0xb8) >> [ 207.593901] r10:e77813c0 r9:e77813c0 r8:ffffffff r7:e709bed4 r6:ffffaa80 r5:00000000 >> [ 207.601732] r4:ffffe000 >> [ 207.604269] [<c0760b10>] (schedule) from [<c0764560>] (schedule_timeout+0x380/0x3dc) >> [ 207.612013] r5:00000000 r4:00000000 >> [ 207.615596] [<c07641e0>] (schedule_timeout) from [<c017b698>] (rcu_gp_kthread+0x668/0xe2c) >> [ 207.623863] r10:c0b79018 r9:0000014d r8:0000014c r7:00000001 r6:00000000 r5:c0b10ad0 >> [ 207.631693] r4:c0b10980 >> [ 207.634230] [<c017b030>] (rcu_gp_kthread) from [<c013ddd8>] (kthread+0x148/0x160) >> [ 207.641712] r7:c0b10980 >> [ 207.644249] [<c013dc90>] (kthread) from [<c01010e8>] (ret_from_fork+0x14/0x2c) >> [ 207.651472] Exception stack(0xe709bfb0 to 0xe709bff8) >> [ 207.656527] bfa0: 00000000 00000000 00000000 00000000 >> [ 207.664709] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 >> [ 207.672890] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 >> [ 207.679508] r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:c013dc90 >> [ 207.687340] r4:e7026f4 >> >> Continuing the anecdotal testing, I can't seem to be able to trigger the >> lockup if i have ever had two ssh sessions open to the systems. And >> about half the time I can't trigger it at all but after a reset of the >> system it triggers with just hitting the return key 2-5 times of opening >> a ssh session and just hitting the return key. But please take this part >> with a grain of salt as it's done by the monkey testing method :-) >> >> All tests above have been run base on c18bb396d3d261eb ("Merge >> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net"). >> >>> >>>> > > [snip] > >>>> >>>> I'm a bit lost on how to progress with this issue and would appreciate >>>> any help you can provide to help me figure this out. >>> >>> Can you send me your config ? >>> >>> I'm going to prepare a debug patch to spy what's happening when entering idle > > I'd like to narrow the problem a bit more with the 2 patchies aboves. Can you try > them separatly on top of c18bb396d3d261eb ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) > and check if one of them fixes the problem ?i > > (They should apply on linux-next as well) > > First patch always kick ilb instead of doing ilb on local cpu before entering idle > > --- > kernel/sched/fair.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0951d1c..b21925b 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9739,8 +9739,7 @@ static void nohz_newidle_balance(struct rq *this_rq) > * candidate for ilb instead of waking up another idle CPU. > * Kick an normal ilb if we failed to do the update. > */ > - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE)) > - kick_ilb(NOHZ_STATS_KICK); > + kick_ilb(NOHZ_STATS_KICK); > raw_spin_lock(&this_rq->lock); > } > > I tested both patches, with both of them the issue still occurs. However, on top of linux-next from yesterday I have the impression that it happens less frequent with the second patch. On top of the commit mentioned by you I don't see a change in system behavior with either patch. Regards, Heiner