Re: SMTC support status in latest git head.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



At this point the logical thing to do would seem to look at your kernel
image and disassemble smtc_ipi_replay(), which is where the EPC of VPE 0
shows the last exception to have been taken.  That's a critical SMTC
routine that gets called whenever an xxx_irq_restore() enables
interrupts, so that virtual per-TC IPI interrupts that were posted while
the TC had interrupts disabled can be handled deterministically.  As I
mentioned in an earlier message, there was some cleanup work from David
Howell that changed a number of irq management-related function names
and prototypes across all architectures, which went into linux-mips.org
at very roughly the time of the breakage.  The SMTC overlay over the irq
implementation has been pretty robust, but it's written in a perhaps
doomed attempt to be both efficient and using a maximum amount of common
code with the general case.  A mechanical or semi-mechanical change
could conceivably have broken things.

            Regards,

            Kevin K.


On 12/31/2010 4:27 AM, Anoop P A wrote:
> Hi ,
>
> Kernel hangs on stop_machine call. Please find mt reg dump below.
> Another important observation is even though 2.6.33 kernel + stackframe
> patch well passes calibration hang , I am still unable boot in to a
> initramfs root ( verified ramfs working with VSMP). So it looks like
> still some issue to fix between 2.6.32 and 2.6.33 .
> ######################## Log ###########################
>
> === MIPS MT State Dump ===
> -- Global State --
>    MVPControl Passed: 00000005
>    MVPControl Read: 00000004
>    MVPConf0 : a8008406
> -- per-VPE State --
>   VPE 0
>    VPEControl : 00008000
>    VPEConf0 : 800f0003
>    VPE0.Status : 11004201
>    VPE0.EPC : 8010dc54 smtc_ipi_replay+0xcc/0x108
>    VPE0.Cause : 50804000
>    VPE0.Config7 : 00010000
>   VPE 1
>    VPEControl : 00068006
>    VPEConf0 : 80cf0003
>    VPE1.Status : 11008301
>    VPE1.EPC : 801022a0 r4k_wait+0x20/0x40
>    VPE1.Cause : 50800000
>    VPE1.Config7 : 00010000
> -- per-TC State --
>   TC 0 (current TC with VPE EPC above)
>    TCStatus : 18102000
>    TCBind : 00000000
>    TCRestart : 803fa19c printk+0xc/0x30
>    TCHalt : 00000000
>    TCContext : 00000000
>   TC 1
>    TCStatus : 18902000
>    TCBind : 00200000
>    TCRestart : 801022a0 r4k_wait+0x20/0x40
>    TCHalt : 00000000
>    TCContext : 00140000
>   TC 2
>    TCStatus : 18902000
>    TCBind : 00400000
>    TCRestart : 801022a0 r4k_wait+0x20/0x40
>    TCHalt : 00000000
>    TCContext : 00280000
>   TC 3
>    TCStatus : 18902000
>    TCBind : 00600000
>    TCRestart : 801022a0 r4k_wait+0x20/0x40
>    TCHalt : 00000000
>    TCContext : 003c0000
>   TC 4
>    TCStatus : 18902000
>    TCBind : 00800001
>    TCRestart : 8010229c r4k_wait+0x1c/0x40
>    TCHalt : 00000000
>    TCContext : 00500000
>   TC 5
>    TCStatus : 18902000
>    TCBind : 00a00001
>    TCRestart : 8010229c r4k_wait+0x1c/0x40
>    TCHalt : 00000000
>    TCContext : 00640000
>   TC 6
>    TCStatus : 18902000
>    TCBind : 00c00001
>    TCRestart : 8010229c r4k_wait+0x1c/0x40
>    TCHalt : 00000000
>    TCContext : 00780000
> Counter Interrupts taken per CPU (TC)
> 0: 0
> 1: 0
> 2: 0
> 3: 0
> 4: 0
> 5: 0
> 6: 0
> 7: 0
> Self-IPI invocations:
> 0: 12
> 1: 0
> 2: 0
> 3: 0
> 4: 0
> 5: 5
> 6: 4
> 7: 0
> IPIQ[0]: head = 0x0, tail = 0x0, depth = 0
> IPIQ[1]: head = 0x0, tail = 0x0, depth = 0
> IPIQ[2]: head = 0x0, tail = 0x0, depth = 0
> IPIQ[3]: head = 0x0, tail = 0x0, depth = 0
> IPIQ[4]: head = 0x0, tail = 0x0, depth = 0
> IPIQ[5]: head = 0x0, tail = 0x0, depth = 0
> IPIQ[6]: head = 0x0, tail = 0x0, depth = 0
> IPIQ[7]: head = 0x0, tail = 0x0, depth = 0
> 0 Recoveries of "stolen" FPU
> ===========================
>
> ################################################################
>
> Thanks
> Anoop
>
> On Tue, 2010-12-28 at 00:43 -0800, Kevin D. Kissell wrote:
>> I took a quick look last night, and the only thing that looked vaguely 
>> dangerous in changes since the timer changes I alluded to earlier was 
>> the global naming cleanup of irq-related function names that David 
>> Howell submitted.  The diff didn't look dangerous in itself, but some of 
>> the definitions are nested subtly for SMTC to maximize the amount of 
>> common code, and I could imagine something getting lost in translation 
>> there.  If that were really the problem, it would of course affect much 
>> more than just the timer subsystem, but early in the boot process, 
>> timers are pretty much the only interrupts that have to be handled 
>> correctly.
>>
>> I'm travelling today, but will take a look at timekeeping_notify() 
>> tomorrow or the next day...
>>
>> /K.
>>
>> On 12/28/10 12:19 AM, Anoop P A wrote:
>>> Hi,
>>>
>>> I had a glance into the code diff without notice of any suspect-able
>>> code .
>>> Tracing the hang showed that it is getting hanged in timekeeping_notify
>>> function.
>>>
>>> Thanks,
>>> Anoop
>>>
>>> PS: I may not be available until Thursday
>>>
>>> On Mon, 2010-12-27 at 22:49 +0530, Anoop P A wrote:
>>>> Hi Kevin,
>>>>
>>>> It is very unlikely that the patch you pointed has any impact on the the
>>>> hang I am seeing. The patch you have mentioned got into kernel around
>>>> 2.6.32 timeframe. I am able to boot both 2.6.32 and  2.6.33 kernel ( +
>>>> stackframe patch) .
>>>>
>>>> Hi Stuart,
>>>>
>>>> I haven't got much time to spend on this today.
>>>>
>>>> I had got 2.6.36-stable(+ stack frame patch) booting last day and I have
>>>> observed hang issue with 2.6.37-rc1 ( Same as rc6 and current git head)
>>>>
>>>> So probably some patches in 2.6.37 branch introduced this hang.
>>>>
>>>> Hopefully I will get some free slot tomorrow so that I can look into
>>>> code diff .
>>>>
>>>> Thanks
>>>> Anoop
>>>>
>>>> On Mon, 2010-12-27 at 09:49 -0600, STUART VENTERS wrote:
>>>>> Kevin,
>>>>>
>>>>> Outstanding, sometimes it's better to be lucky than good.
>>>>>
>>>>>
>>>>> Anoop,
>>>>>
>>>>> Maybe we can get lucky again.
>>>>>
>>>>> If you can isolate the .33 works/.37 works_not bug to a specific pair of versions,
>>>>>     I'll be happy to do another diff.
>>>>>
>>>>>
>>>>> Hope you'll have had a good Christmas as well.
>>>>>    We've had snow in Alabama since Christmas eve!
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Stuart
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Kevin D. Kissell [mailto:kevink@xxxxxxxxxxxxx]
>>>>> Sent: Friday, December 24, 2010 5:34 PM
>>>>> To: Anoop P A
>>>>> Cc: STUART VENTERS; Anoop P.A.; linux-mips@xxxxxxxxxxxxxx
>>>>> Subject: Re: SMTC support status in latest git head.
>>>>>
>>>>>
>>>>> Ah, well, at least we have a stackframe.h fix that preserves David's
>>>>> performance tweak for the deeper pipelined processors.  In looking for
>>>>> this, I did notice that someone did some modification to the SMTC clock
>>>>> tick logic that I was skeptical had ever been tested.  If you've still
>>>>> got that kernel binary handy, you might check to see if it boots with
>>>>> maxtcs=1 maxvpes=1, maxtcs=2 maxvpes=1, and/or maxtcs=2 maxvpes=2.
>>>>>
>>>>> Oh, yes, and Merry Christmas one and all!
>>>>>
>>>>>               Regards,
>>>>>
>>>>>               Kevin K.
>>>>>
>>>>> On 12/24/10 8:02 AM, Anoop P A wrote:
>>>>>> On Fri, 2010-12-24 at 06:53 -0800, Kevin D. Kissell wrote:
>>>>>>> Excellent!  Now, does the attached patch (relative to 2.6.37.11) also
>>>>>>> fix things, while preserving the other fixes and performance enhancements?
>>>>>>>
>>>>>> I have tested that patch with 2.6.37 branch it well passes calibration
>>>>>> loop but hangs after switching to mips closource
>>>>>>
>>>>>> TC 6 going on-line as CPU 6
>>>>>> Brought up 7 CPUs
>>>>>> bio: create slab<bio-0>   at 0
>>>>>> SCSI subsystem initialized
>>>>>> Switching to clocksource MIPS
>>>>>>
>>>>>> I Presume this is a different issue as restoring older file didn't help
>>>>>> much to get rid of this hang.
>>>>>>
>>>>>> diff --git a/arch/mips/include/asm/stackframe.h
>>>>>> b/arch/mips/include/asm/stackframe.h
>>>>>> index 58730c5..7fc9f10 100644
>>>>>> --- a/arch/mips/include/asm/stackframe.h
>>>>>> +++ b/arch/mips/include/asm/stackframe.h
>>>>>> @@ -195,9 +195,9 @@
>>>>>>    		 * to cover the pipeline delay.
>>>>>>    		 */
>>>>>>    		.set	mips32
>>>>>> -		mfc0	v1, CP0_TCSTATUS
>>>>>> +		mfc0	v0, CP0_TCSTATUS
>>>>>>    		.set	mips0
>>>>>> -		LONG_S	v1, PT_TCSTATUS(sp)
>>>>>> +		LONG_S	v0, PT_TCSTATUS(sp)
>>>>>>    #endif /* CONFIG_MIPS_MT_SMTC */
>>>>>>    		LONG_S	$4, PT_R4(sp)
>>>>>>    		LONG_S	$5, PT_R5(sp)
>>>>>>
>>>>>>
>>>>>>> /K.
>>>>>>>
>>>>>>> On 12/24/10 6:39 AM, Anoop P A wrote:
>>>>>>>> Hi Kevin, Stuart ,
>>>>>>>>
>>>>>>>> Woohooo You guys spotted !.
>>>>>>>>
>>>>>>>>     http://git.linux-mips.org/?p=linux.git;a=commit;h=d5ec6e3c seems to be
>>>>>>>> the culprit
>>>>>>>>
>>>>>>>> Once I restored previous version of stackframe.h 2.6.33-stable started
>>>>>>>> booting !.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Anoop
>>>>>>>>
>>>>>>>> On Fri, 2010-12-24 at 04:32 -0800, Kevin D. Kissell wrote:
>>>>>>>>> Thank you, Stuart!  I've spotted some definite breakage to SMTC between
>>>>>>>>> those versions.  In arch/mips/include/asm/stackframe.h, someone moved
>>>>>>>>> the store of the Status register value in SAVE_SOME (line 169 or 204,
>>>>>>>>> depending on the version) from two instructions after the mfc0 to a
>>>>>>>>> point after the #ifdef for SMTC, presumably to get better pipelining of
>>>>>>>>> the register access.  Unfortunately, the v1 register is also used in the
>>>>>>>>> SMTC-specific fragment to save TCStatus, so the Status value gets
>>>>>>>>> clobbered before it gets stored.  This will eventually result in the
>>>>>>>>> Status register getting a TCStatus value, which has some bits on common,
>>>>>>>>> but isn't identical and sooner or later Bad Things will happen.
>>>>>>>>>
>>>>>>>>> I'm a little surprised this wasn't caught by visual inspection of the patch.
>>>>>>>>>
>>>>>>>>> Possible solutions would include reverting the store of the CP0_STATUS
>>>>>>>>> value to the block above the #ifdef, or, to retain whatever performance
>>>>>>>>> advantage was obtained by moving the store downward, to use v0/$2
>>>>>>>>> instead of v1/$3, as the staging register for the TCStatus value.  I'd
>>>>>>>>> lean toward the second option, but I'm not in a position to test and
>>>>>>>>> submit a patch just now.
>>>>>>>>>
>>>>>>>>>                 Regards,
>>>>>>>>>
>>>>>>>>>                 Kevin K.
>>>>>>>>>
>>>>>>>>> On 12/23/10 1:09 PM, STUART VENTERS wrote:
>>>>>>>>>> Kevin,
>>>>>>>>>>
>>>>>>>>>> I'm not sure if it's useful,
>>>>>>>>>>        but finally I got the time to look at the two kernel versions Anoop pointed out.
>>>>>>>>>>         works   2.6.32-stable with patch 804
>>>>>>>>>>         works_not 2.6.33-stable
>>>>>>>>>>
>>>>>>>>>> greping for files with CONFIG_MIPS_MT_SMTC
>>>>>>>>>>        and looking for timer interrupt related stuff found the following differences:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> arch/mips/include/asm/irq.h
>>>>>>>>>> arch/mips/kernel/irq.c
>>>>>>>>>>       do_IRQ
>>>>>>>>>>
>>>>>>>>>> arch/mips/include/asm/stackframe.h
>>>>>>>>>>       SAVE_SOME SAVE_TEMP get/set_saved_sp
>>>>>>>>>>
>>>>>>>>>> arch/mips/include/asm/time.h
>>>>>>>>>>       clocksource_set_clock
>>>>>>>>>>
>>>>>>>>>> arch/mips/kernel/process.c
>>>>>>>>>>       cpu_idle
>>>>>>>>>>
>>>>>>>>>> arch/mips/kernel/smtc.c
>>>>>>>>>>       __irq_entry
>>>>>>>>>>       ipi_decode
>>>>>>>>>>           SMTC_CLOCK_TICK
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Enclosed are the two subsets of files for a more expert look.
>>>>>>>>>>
>>>>>>>>>> I'll try to look in more detail after Christmas.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> Stuart
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>




[Index of Archives]     [Linux MIPS Home]     [LKML Archive]     [Linux ARM Kernel]     [Linux ARM]     [Linux]     [Git]     [Yosemite News]     [Linux SCSI]     [Linux Hams]

  Powered by Linux