Re: [nForce4] - Repeatable issues with nForce 4

Jacobo Pantoja <jacobopantoja@xxxxxxxxx> · Sun, 30 Nov 2014 12:03:17 +0100

Hello,

It took me a while, but I got time to recompile and reproduce the
lockup with ultra-verbose output.

Three out of four lockups seem identical (1, 2 and 4) but number 3
seems different. The trigger mechanism was the same: connect through
ssh (verbose screen made impossible working locally), start dd'ing
from disk to /dev/null in an area with some bad sectors, and wait
until lockup.

It is 100% reproducible, at least for the moment.

The link with the 4 photos:
https://drive.google.com/folderview?id=0B4EqBXYvV-kTR2daRm1GYVBDbWs&usp=sharing

Any idea about what to test now?

Best regards,
Jpantoja

On 16 September 2014 at 04:47, Robert Hancock <hancockrwd@xxxxxxxxx> wrote:
> On Mon, Sep 15, 2014 at 6:41 AM, Jacobo Pantoja <jacobopantoja@xxxxxxxxx> wrote:
>> Dears,
>>
>> Thank you for taking your time to answer. See my comments below.
>>
>> On 14 September 2014 22:03, Robert Hancock <hancockrwd@xxxxxxxxx> wrote:
>>> On Sun, Sep 14, 2014 at 3:37 AM, Tejun Heo <tj@xxxxxxxxxx> wrote:
>>>>
>>>> (cc'ing Robert Hancock)
>>>>
>>>> Hello,
>>>>
>>>> On Sat, Sep 13, 2014 at 11:50:08PM +0200, Jacobo Pantoja wrote:
>>>> > (Sorry if you receive twice, I have noticed that the first email had
>>>> > blank subject)
>>>> > Dear Tejun Heo and linux-ide team,
>>>> >
>>>> > I'm Jacobo Pantoja. I'm a technology passionate and electronics
>>>> > engineer.
>>>> > I have my ("beloved") computer with an nForce4 chipset, and I have had
>>>> > almost
>>>> > always the ADMA interface enabled. The board itself is ASUS A8N-E, with
>>>> > reportedly CK804 chipset, if it may be relevant at all.
>>>> >
>>>> > As suggested by Tejun, I'm sending my problem to the list.
>>>> >
>>>> > I noticed that from time to time the machine was freezed, but I was not
>>>> > able to correctly catch the trigger. Till yesterday.
>>>> >
>>>> > I noticed that one of my 2 TB drives had some few sectors, which were
>>>> > marked as "pending reallocation", but not reallocated. When this has
>>>> > happened to me (in different computers, though), I solved it by dd'ing
>>>> > the whole disk, locating the bad sector(s) and filling it with zeroes.
>>>> > So I tried... and I have discovered that when a bad sector is tried to
>>>> > be read, the system locks up.
>>>> >
>>>> > You may find attached:
>>>> > * dmesg when adma activated (but not including the moment of the error
>>>> >        because the computer freezes)
>>>> > * photo taken in the moment of the error with adma activated
>>>> > * dmesg when adma is not activated, including the moment of the error
>>>> >
>>>> > This is totally reproducible**, and I am willing to do any additional
>>>> > testing that may help in solving this issue, if there is any interest.
>>>> >
>>>> > **I have noticed, while trying to provide clear dmesg's and so on, that
>>>> > if I do the reading with ADMA disabled, the sector may be marked (as
>>>> > expected)
>>>> > as definitively bad block, and then reallocated. Given that the drive
>>>> > has
>>>> > still some few bad blocks, we have still some chances of reproducing
>>>> > again
>>>> > and again, but really I don't know for sure how many tries do we have.
>>>>
>>>> You can create bad blocks using hdparm --make-bad-sector on most
>>>> drives.
>>>>
>>
>> If I understand correctly, the lockups occur when trying to read bad
>> sectors, prior to reallocating them. I have read hdparm's man page,
>> but I don't understand clearly if there is going to be the same effect
>> (e.g. is it going to timeout in the same way?). I can check that but
>> at first I need to make my whole backup.
>
> I think normally the drive reacts in the same way as any other kind of
> bad sector, but it likely depends on the specific drive.
>
>>
>>>> So, the controller locks up the whole machine while trying to handle a
>>>> UNC error.  Heh, it even times out on READ_LOG_EXT during EH.
>>>> Unfortunately, I'm not sure there's much we can do at this point.
>>>> IIRC, NV ADMA support never really matured which is why it never got
>>>> turned on by default.  I wouldn't be too surprised if the issue is
>>>> with the controller itself.  Quite a few of these first-gen NCQ
>>>> controllers were quite flaky after all.  Robert should know a lot
>>>> better than me tho.
>>
>> Ok, the point is if there is something to test before giving up
>> definitively with the ADMA mode for this controller. For me it is not
>> that important to have it working, but since the hardware is in place,
>> my technologist heart tells me to use it. In any case, I can
>> definitely live without it.
>>
>>>
>>>
>>> I don't have much great insight, but it seems like these controllers
>>> definitely have some issues with error handling. From what I saw, some types
>>> of errors would basically cause the controller to seize up and not respond
>>> properly to CPU requests on the HT bus (there were some reports of MCE
>>> errors referring to HT timeouts). I've seen the CK804 lock up Windows, I
>>> think with either the NVIDIA or the default Microsoft IDE drivers installed,
>>> when doing things like reading a damaged DVD on an optical drive connected
>>> to the CK804 SATA controller, which leads me to suspect it's some kind of
>>> hardware issue that we may not be able to get around (even not using ADMA
>>> doesn't appear to be a complete solution). I've asked NVIDIA for help about
>>> some of the issues that were reported but it seems like they mostly clammed
>>> up on this particular subject.
>>>
>>> It seems like these controllers were tested with, and work fine with, hard
>>> drives that don't have any bad sectors or other issues, but as soon as
>>> errors start happening things start to fall apart. They came out a bit
>>> before optical drives on SATA started becoming commonplace where they would
>>> have had to deal with more error handling.
>>
>> My main concern is that the whole computer is freezed. Is there any
>> additional kernel debug switch or whatever that may help in
>> understanding the problem?
>
> Turning on some or all of the libata debug options (at the cost of a
> likely huge amount of output) may be useful. You can try changing the
> "#undef ATA_DEBUG" and "#undef ATA_VERBOSE_DEBUG" lines in
> include/linux/libata.h to #define and rebuilding the kernel. If you
> can reproduce the lockup after that, it may indicate where it's
> occurring, though you may still need to add more output to narrow it
> down.
>
> It's quite possible that there's no reasonable workaround, but I don't
> know that anyone has taken the effort to debug this very thoroughly. I
> don't have a CK804 machine anymore so I can't provide too much
> first-hand assistance myself.
>
>>
>> I have seen some obscurity regarding ADMA for CK804 in the kernel
>> commits, but if we can isolate and reproduce the problems, perhaps we
>> can find a workaround.
>>
>>
>> I have another (different) annoying thing: why my emails do not appear
>> in the mailing list logs [1]? I'm not sure if the people subscribed to
>> the list are receiving my emails, or only your responses to them.
>
> Looks like at least this email made it to the list.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html