Hi , Kernel hangs on stop_machine call. Please find mt reg dump below. Another important observation is even though 2.6.33 kernel + stackframe patch well passes calibration hang , I am still unable boot in to a initramfs root ( verified ramfs working with VSMP). So it looks like still some issue to fix between 2.6.32 and 2.6.33 . ######################## Log ########################### === MIPS MT State Dump === -- Global State -- MVPControl Passed: 00000005 MVPControl Read: 00000004 MVPConf0 : a8008406 -- per-VPE State -- VPE 0 VPEControl : 00008000 VPEConf0 : 800f0003 VPE0.Status : 11004201 VPE0.EPC : 8010dc54 smtc_ipi_replay+0xcc/0x108 VPE0.Cause : 50804000 VPE0.Config7 : 00010000 VPE 1 VPEControl : 00068006 VPEConf0 : 80cf0003 VPE1.Status : 11008301 VPE1.EPC : 801022a0 r4k_wait+0x20/0x40 VPE1.Cause : 50800000 VPE1.Config7 : 00010000 -- per-TC State -- TC 0 (current TC with VPE EPC above) TCStatus : 18102000 TCBind : 00000000 TCRestart : 803fa19c printk+0xc/0x30 TCHalt : 00000000 TCContext : 00000000 TC 1 TCStatus : 18902000 TCBind : 00200000 TCRestart : 801022a0 r4k_wait+0x20/0x40 TCHalt : 00000000 TCContext : 00140000 TC 2 TCStatus : 18902000 TCBind : 00400000 TCRestart : 801022a0 r4k_wait+0x20/0x40 TCHalt : 00000000 TCContext : 00280000 TC 3 TCStatus : 18902000 TCBind : 00600000 TCRestart : 801022a0 r4k_wait+0x20/0x40 TCHalt : 00000000 TCContext : 003c0000 TC 4 TCStatus : 18902000 TCBind : 00800001 TCRestart : 8010229c r4k_wait+0x1c/0x40 TCHalt : 00000000 TCContext : 00500000 TC 5 TCStatus : 18902000 TCBind : 00a00001 TCRestart : 8010229c r4k_wait+0x1c/0x40 TCHalt : 00000000 TCContext : 00640000 TC 6 TCStatus : 18902000 TCBind : 00c00001 TCRestart : 8010229c r4k_wait+0x1c/0x40 TCHalt : 00000000 TCContext : 00780000 Counter Interrupts taken per CPU (TC) 0: 0 1: 0 2: 0 3: 0 4: 0 5: 0 6: 0 7: 0 Self-IPI invocations: 0: 12 1: 0 2: 0 3: 0 4: 0 5: 5 6: 4 7: 0 IPIQ[0]: head = 0x0, tail = 0x0, depth = 0 IPIQ[1]: head = 0x0, tail = 0x0, depth = 0 IPIQ[2]: head = 0x0, tail = 0x0, depth = 0 IPIQ[3]: head = 0x0, tail = 0x0, depth = 0 IPIQ[4]: head = 0x0, tail = 0x0, depth = 0 IPIQ[5]: head = 0x0, tail = 0x0, depth = 0 IPIQ[6]: head = 0x0, tail = 0x0, depth = 0 IPIQ[7]: head = 0x0, tail = 0x0, depth = 0 0 Recoveries of "stolen" FPU =========================== ################################################################ Thanks Anoop On Tue, 2010-12-28 at 00:43 -0800, Kevin D. Kissell wrote: > I took a quick look last night, and the only thing that looked vaguely > dangerous in changes since the timer changes I alluded to earlier was > the global naming cleanup of irq-related function names that David > Howell submitted. The diff didn't look dangerous in itself, but some of > the definitions are nested subtly for SMTC to maximize the amount of > common code, and I could imagine something getting lost in translation > there. If that were really the problem, it would of course affect much > more than just the timer subsystem, but early in the boot process, > timers are pretty much the only interrupts that have to be handled > correctly. > > I'm travelling today, but will take a look at timekeeping_notify() > tomorrow or the next day... > > /K. > > On 12/28/10 12:19 AM, Anoop P A wrote: > > Hi, > > > > I had a glance into the code diff without notice of any suspect-able > > code . > > Tracing the hang showed that it is getting hanged in timekeeping_notify > > function. > > > > Thanks, > > Anoop > > > > PS: I may not be available until Thursday > > > > On Mon, 2010-12-27 at 22:49 +0530, Anoop P A wrote: > >> Hi Kevin, > >> > >> It is very unlikely that the patch you pointed has any impact on the the > >> hang I am seeing. The patch you have mentioned got into kernel around > >> 2.6.32 timeframe. I am able to boot both 2.6.32 and 2.6.33 kernel ( + > >> stackframe patch) . > >> > >> Hi Stuart, > >> > >> I haven't got much time to spend on this today. > >> > >> I had got 2.6.36-stable(+ stack frame patch) booting last day and I have > >> observed hang issue with 2.6.37-rc1 ( Same as rc6 and current git head) > >> > >> So probably some patches in 2.6.37 branch introduced this hang. > >> > >> Hopefully I will get some free slot tomorrow so that I can look into > >> code diff . > >> > >> Thanks > >> Anoop > >> > >> On Mon, 2010-12-27 at 09:49 -0600, STUART VENTERS wrote: > >>> Kevin, > >>> > >>> Outstanding, sometimes it's better to be lucky than good. > >>> > >>> > >>> Anoop, > >>> > >>> Maybe we can get lucky again. > >>> > >>> If you can isolate the .33 works/.37 works_not bug to a specific pair of versions, > >>> I'll be happy to do another diff. > >>> > >>> > >>> Hope you'll have had a good Christmas as well. > >>> We've had snow in Alabama since Christmas eve! > >>> > >>> > >>> Regards, > >>> > >>> Stuart > >>> > >>> > >>> -----Original Message----- > >>> From: Kevin D. Kissell [mailto:kevink@xxxxxxxxxxxxx] > >>> Sent: Friday, December 24, 2010 5:34 PM > >>> To: Anoop P A > >>> Cc: STUART VENTERS; Anoop P.A.; linux-mips@xxxxxxxxxxxxxx > >>> Subject: Re: SMTC support status in latest git head. > >>> > >>> > >>> Ah, well, at least we have a stackframe.h fix that preserves David's > >>> performance tweak for the deeper pipelined processors. In looking for > >>> this, I did notice that someone did some modification to the SMTC clock > >>> tick logic that I was skeptical had ever been tested. If you've still > >>> got that kernel binary handy, you might check to see if it boots with > >>> maxtcs=1 maxvpes=1, maxtcs=2 maxvpes=1, and/or maxtcs=2 maxvpes=2. > >>> > >>> Oh, yes, and Merry Christmas one and all! > >>> > >>> Regards, > >>> > >>> Kevin K. > >>> > >>> On 12/24/10 8:02 AM, Anoop P A wrote: > >>>> On Fri, 2010-12-24 at 06:53 -0800, Kevin D. Kissell wrote: > >>>>> Excellent! Now, does the attached patch (relative to 2.6.37.11) also > >>>>> fix things, while preserving the other fixes and performance enhancements? > >>>>> > >>>> I have tested that patch with 2.6.37 branch it well passes calibration > >>>> loop but hangs after switching to mips closource > >>>> > >>>> TC 6 going on-line as CPU 6 > >>>> Brought up 7 CPUs > >>>> bio: create slab<bio-0> at 0 > >>>> SCSI subsystem initialized > >>>> Switching to clocksource MIPS > >>>> > >>>> I Presume this is a different issue as restoring older file didn't help > >>>> much to get rid of this hang. > >>>> > >>>> diff --git a/arch/mips/include/asm/stackframe.h > >>>> b/arch/mips/include/asm/stackframe.h > >>>> index 58730c5..7fc9f10 100644 > >>>> --- a/arch/mips/include/asm/stackframe.h > >>>> +++ b/arch/mips/include/asm/stackframe.h > >>>> @@ -195,9 +195,9 @@ > >>>> * to cover the pipeline delay. > >>>> */ > >>>> .set mips32 > >>>> - mfc0 v1, CP0_TCSTATUS > >>>> + mfc0 v0, CP0_TCSTATUS > >>>> .set mips0 > >>>> - LONG_S v1, PT_TCSTATUS(sp) > >>>> + LONG_S v0, PT_TCSTATUS(sp) > >>>> #endif /* CONFIG_MIPS_MT_SMTC */ > >>>> LONG_S $4, PT_R4(sp) > >>>> LONG_S $5, PT_R5(sp) > >>>> > >>>> > >>>>> /K. > >>>>> > >>>>> On 12/24/10 6:39 AM, Anoop P A wrote: > >>>>>> Hi Kevin, Stuart , > >>>>>> > >>>>>> Woohooo You guys spotted !. > >>>>>> > >>>>>> http://git.linux-mips.org/?p=linux.git;a=commit;h=d5ec6e3c seems to be > >>>>>> the culprit > >>>>>> > >>>>>> Once I restored previous version of stackframe.h 2.6.33-stable started > >>>>>> booting !. > >>>>>> > >>>>>> Thanks, > >>>>>> Anoop > >>>>>> > >>>>>> On Fri, 2010-12-24 at 04:32 -0800, Kevin D. Kissell wrote: > >>>>>>> Thank you, Stuart! I've spotted some definite breakage to SMTC between > >>>>>>> those versions. In arch/mips/include/asm/stackframe.h, someone moved > >>>>>>> the store of the Status register value in SAVE_SOME (line 169 or 204, > >>>>>>> depending on the version) from two instructions after the mfc0 to a > >>>>>>> point after the #ifdef for SMTC, presumably to get better pipelining of > >>>>>>> the register access. Unfortunately, the v1 register is also used in the > >>>>>>> SMTC-specific fragment to save TCStatus, so the Status value gets > >>>>>>> clobbered before it gets stored. This will eventually result in the > >>>>>>> Status register getting a TCStatus value, which has some bits on common, > >>>>>>> but isn't identical and sooner or later Bad Things will happen. > >>>>>>> > >>>>>>> I'm a little surprised this wasn't caught by visual inspection of the patch. > >>>>>>> > >>>>>>> Possible solutions would include reverting the store of the CP0_STATUS > >>>>>>> value to the block above the #ifdef, or, to retain whatever performance > >>>>>>> advantage was obtained by moving the store downward, to use v0/$2 > >>>>>>> instead of v1/$3, as the staging register for the TCStatus value. I'd > >>>>>>> lean toward the second option, but I'm not in a position to test and > >>>>>>> submit a patch just now. > >>>>>>> > >>>>>>> Regards, > >>>>>>> > >>>>>>> Kevin K. > >>>>>>> > >>>>>>> On 12/23/10 1:09 PM, STUART VENTERS wrote: > >>>>>>>> Kevin, > >>>>>>>> > >>>>>>>> I'm not sure if it's useful, > >>>>>>>> but finally I got the time to look at the two kernel versions Anoop pointed out. > >>>>>>>> works 2.6.32-stable with patch 804 > >>>>>>>> works_not 2.6.33-stable > >>>>>>>> > >>>>>>>> greping for files with CONFIG_MIPS_MT_SMTC > >>>>>>>> and looking for timer interrupt related stuff found the following differences: > >>>>>>>> > >>>>>>>> > >>>>>>>> arch/mips/include/asm/irq.h > >>>>>>>> arch/mips/kernel/irq.c > >>>>>>>> do_IRQ > >>>>>>>> > >>>>>>>> arch/mips/include/asm/stackframe.h > >>>>>>>> SAVE_SOME SAVE_TEMP get/set_saved_sp > >>>>>>>> > >>>>>>>> arch/mips/include/asm/time.h > >>>>>>>> clocksource_set_clock > >>>>>>>> > >>>>>>>> arch/mips/kernel/process.c > >>>>>>>> cpu_idle > >>>>>>>> > >>>>>>>> arch/mips/kernel/smtc.c > >>>>>>>> __irq_entry > >>>>>>>> ipi_decode > >>>>>>>> SMTC_CLOCK_TICK > >>>>>>>> > >>>>>>>> > >>>>>>>> Enclosed are the two subsets of files for a more expert look. > >>>>>>>> > >>>>>>>> I'll try to look in more detail after Christmas. > >>>>>>>> > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> > >>>>>>>> Stuart > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > > >