Re: [PATCH 2/2] PCI: fix system hang issue of Marvell SATA host controller

Myron Stowe <myron.stowe@xxxxxxxxx> · Mon, 11 Mar 2013 15:19:36 -0600

On Mon, Mar 11, 2013 at 3:15 AM, Xiangliang Yu <yuxiangl@xxxxxxxxxxx> wrote:
> Hi, Myron
>
>> >>> >> > Fix system hang issue: if first accessed resource file of BAR0 ~
>> >>> >> > BAR4, system will hang after executing lspci command
>> >
>> > Any question? Thanks!
>>
>> Googling and looking at the PCI IDs data base I see that the Marvell
>> 9125 device has been around since sometime around 2010 and that there
>> even seem to be a number of follow-on iterations of the chip (i.e.
>> 9128, 9120, ...).  It seems incredibly unlikely that Marvell made a
>> device that has been shipping for 2+ years with five I/O BARs that do
>> not work and we are only now finding out such.
> Just only 9125 has the issue.
>
>> Am I missing something relevant here?  Can you verify that this device
>> has is indeed not new and has been successfully used in recent
>> platforms?
> The device can used in recent platforms.

Could you please be a little more explicit (and I'll try to be more
specific in my questions) as I was not able to get much, if any,
understanding from the responses.

I would like to understand if the 9125 device has had issues
corresponding to accessing the I/O Port space mapped by its BARS from
the very beginning - i.e. there have been no platforms in the last 2+
years that have been able to successfully drive this device using its
I/O BAR accessing methods?

What seems more likely is that only now, due to some new and yet
unknown reason, are issues corresponding to accessing the I/O Port
space mapped by its BARS occurring - perhaps something to do with a
new processor or chipset.

Are you seeing any similar issues when booting Windows on the same platform?

This information could be helpful in tracking down the root cause.

>
>> You just recently responded with  "... I just got the info from HP.
>> ..." so I'm assuming this is an issue that has just been encountered
>> on some type of HP system - is this correct?  If so, do you have
>> access to the system to provide the logs I asked for earlier?  Also,
>> is there anything special or completely new about this platform that
>> would explain away the arguments for why this is probably not a
>> Marvell device issue?
> I can reproduce the issue with following platform:
> CPU: Intel i7-3770 3.40GHZ
> OS: centos 6.4

6.4 is a fairly old kernel by now - 2.6.32.  Have you been able to try
an upstream kernel and if so, what were the results?

>
> Now, the situation is like this:
> I captured the PCIE trace with analyzer and found that 1st BE is 0x1111 when
> accessing IO port space. But 9125 spec has some limitation, and the BE must
> be
> 0x0100, to access the 2nd byte only. So, the chip will go to bad.

Great, this is new, interesting, data.  Is the 9125 spec publicly
accessible and/or could you elaborate on the "some limitation"
comment?

I'm fairly sure that PCI Express supports byte-granular accesses to
I/O port space (I'll try to read up on this some more as I don't
usually work at this low of a level) and it seems unlikely that this
area would be broken in a chipset, especially an Intel one.

A byte enable (BE) of 0x1111 suggests the CPU did a 32-bit I/O port
read.  Does the 9125 device only support one-byte I/O port accesses
and when presented with larger request types it doesn't respond
properly?  I have to admit I don't know what the correct response
would be - perhaps a master abort.  Do you know what the PCI host
controller would return to the CPU so the CPU wouldn't hang in such a
case?

> Can you tell me what can I do to fix the issue? Thanks!

Once we understand the root cause I'm sure we'll be able to come up
with a solution.  Let's keep honing in on the problem for now until we
get to that understanding.
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html