Hi Sebastian, Were you able to gain any insight from the traces? If we were to proceed with reverting the kernel/sched/core.c patch in our build of 3.18.29-rt30, would the addition of the WARN_ON_ONCE(p->migrate_disable_atomic <= 0) debug check that you recommended (2016/07/29) be sufficient for detecting imbalances? We would perform extended testing on multiple systems to determine the effects of reverting the patch. Cheers, Carol > -----Original Message----- > From: Carol Wong > Sent: Wednesday, August 03, 2016 6:32 PM > To: 'Sebastian Andrzej Siewior' > Cc: linux-rt-users@xxxxxxxxxxxxxxx; David Hauck; Preston Hauck > Subject: RE: v3.18-RT > > Hi Sebastian, > > I made the suggested change to sched/core.c and verified that > CONFIG_SCHED_DEBUG=y. I reproduced the crash 3 times and captured the > attached traces. > > Thanks, > Carol > > > -----Original Message----- > > From: Sebastian Andrzej Siewior [mailto:bigeasy@xxxxxxxxxxxxx] > > Sent: Friday, July 29, 2016 9:20 AM > > To: Carol Wong > > Cc: linux-rt-users@xxxxxxxxxxxxxxx; David Hauck; Preston Hauck > > Subject: Re: v3.18-RT > > > > * Carol Wong | 2016-07-20 20:53:21 [+0000]: > > > > >Hi Sebastian, > > Hi Carol, > > > > >We finally traced the boot-up crash to the following patch in > > kernel/sched/core.c: > > > > > >https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable- > > rt.git/com > > >mit/?h=v3.18-rt&id=62044e554f14547061afcfef7f0aceda43e28982 > > > > > >After reverting the two-line patch in 3.18.29-rt30, the crash no > > longer occurs on our dual Xeon (2x12 core) system. > > > > > >Other observations: > > >- Does not reproduce on single processor (2 and 4 core) systems > > >- Reproduces under 3.18.27-rt27 and 3.18.36-rt38 on the dual Xeon > > >- Does not reproduce on 3.18.27-rt26 and earlier on the dual Xeon > > >- Reproduces more frequently on .29-rt30 (1 in 20 reboots) > compared > > to > > >.27-rt27 (1 in 100 reboots) > > > > > >So far we've not observed any side effects after reverting this > > patch. > > > > This was part of CPU hotplug fixups. Lockdep might be broken > without > > it but I am not sure if is most of the time the case or just during > > hotplug. > > > > >I understand that a high core count system may not be easy to come > > by, so if there are diagnostics or patches you would like to try on > > the dual Xeon system, we can assist with that. > > > > With that patch, migrate_disable() skips the whole preempt-lazy + > > pin-cpu code if called with IRQs off. Since interrupts are disabled > we > > can't migrate to another so it is a possible optimsation. > > It only makes a difference if migrate_disable() + migrate_enable() > > calls are not in balance. The commit > > https://git.kernel.org/cgit/linux/kernel/git/rt/linux-stable- > > rt.git/commit/?h=v3.18- > rt&id=8d51d3a296b6ec4aebd0d6d7e1b7162cd9bf6662 > > is one example where I fixed the inbalance. > > Do you get additional backtraces with CONFIG_SCHED_DEBUG enabled? > > > > There is one thing the debug code does not cover, so could you > please > > add this chunk? > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c index > > 140ee06079b6..1f8613f77598 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -3229,6 +3229,7 @@ void migrate_enable(void) > > > > if (in_atomic() || irqs_disabled()) { #ifdef > CONFIG_SCHED_DEBUG > > + WARN_ON_ONCE(p->migrate_disable_atomic <= 0); > > p->migrate_disable_atomic--; > > #endif > > return; > > > > >Cheers, > > >Carol Wong > > >NetAcquire Corporation > > > > Sebastian ��.n��������+%������w��{.n�����{�����ǫ���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f