Re: [PATCH] BIOS SATA legacy mode failure

Robert Hancock <hancockrwd@xxxxxxxxx> · Tue, 15 Oct 2013 18:16:41 -0600

On Sun, Oct 13, 2013 at 6:02 AM, Levente Kurusa <levex@xxxxxxxxx> wrote:
> 2013-10-13 07:57 keltezéssel, Robert Hancock írta:
>> On Sat, Oct 12, 2013 at 3:29 AM, Levente Kurusa <levex@xxxxxxxxx> wrote:
>>> 2013-10-12 04:06 keltezéssel, Robert Hancock írta:
>>>> On Fri, Oct 11, 2013 at 10:07 AM, Levente Kurusa <levex@xxxxxxxxx> wrote:
>>>>> 2013-10-01 06:25 keltezéssel, Robert Hancock írta:
>>>>>> On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@xxxxxxxxx> wrote:
>>>>>>> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@xxxxxxxxx> wrote:
>>>>>>>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta:
>>>>>>>>
>>>>>>>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@xxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta:
>>>>>>>>>>
>>>>>>>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@xxxxxxxxx> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@xxxxxxxxx>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop.
>>>>>>>>>>>>>>>>>>>> dmesg:
>>>>>>>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50)
>>>>>>>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
>>>>>>>>>>>>>>>>>>>> frozen
>>>>>>>>>>>>>>>>>>>> ata3.00: failed command: READ DMA
>>>>>>>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>                     res 40/00:00:00:00:00/00:00:00:00:00/00
>>>>>>>>>>>>>>>>>>>> Emask
>>>>>>>>>>>>>>>>>>>> 0x4
>>>>>>>>>>>>>>>>>>>> (timeout)
>>>>>>>>>>>>>>>>>>>> ata3.00: status: { DRDY }
>>>>>>>>>>>>>>>>>>>> ata3: soft resetting link
>>>>>>>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error)
>>>>>>>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0
>>>>>>>>>>>>>>>>>>>> ata3: EH complete
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Patch that fixes the infinite loop:
>>>>>>>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>>>> index f9476fb..eeedf80 100644
>>>>>>>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c
>>>>>>>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct
>>>>>>>>>>>>>>>>>>>> ata_link
>>>>>>>>>>>>>>>>>>>> *link)
>>>>>>>>>>>>>>>>>>>>                                   ehc->i.action, frozen,
>>>>>>>>>>>>>>>>>>>> tries_buf);
>>>>>>>>>>>>>>>>>>>>                       if (desc)
>>>>>>>>>>>>>>>>>>>>                               ata_dev_err(ehc->i.dev, "%s\n",
>>>>>>>>>>>>>>>>>>>> desc);
>>>>>>>>>>>>>>>>>>>> +               ehc->i.dev->exce_cnt ++;
>>>>>>>>>>>>>>>>>>>> +               ata_dev_warn(ehc->i.dev, "Number of exceptions:
>>>>>>>>>>>>>>>>>>>> %d\n",
>>>>>>>>>>>>>>>>>>>> ehc->i.dev->exce_cnt);
>>>>>>>>>>>>>>>>>>>> +               /**
>>>>>>>>>>>>>>>>>>>> +                  * The device is failing terribly,
>>>>>>>>>>>>>>>>>>>> +                 * disable it to prevent damage.
>>>>>>>>>>>>>>>>>>>> +                 */
>>>>>>>>>>>>>>>>>>>> +               if(ehc->i.dev->exce_cnt > 2)
>>>>>>>>>>>>>>>>>>>> +                       ata_dev_disable(ehc->i.dev);
>>>>>>>>>>>>>>>>>>>>               } else {
>>>>>>>>>>>>>>>>>>>>                       ata_link_err(link, "exception Emask 0x%x
>>>>>>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>>>>>>>                                    "SAct 0x%x SErr 0x%x action
>>>>>>>>>>>>>>>>>>>> 0x%x%s%s\n",
>>>>>>>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h
>>>>>>>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644
>>>>>>>>>>>>>>>>>>>> --- a/include/linux/libata.h
>>>>>>>>>>>>>>>>>>>> +++ b/include/linux/libata.h
>>>>>>>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device {
>>>>>>>>>>>>>>>>>>>>               u8
>>>>>>>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE];
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>               /* error history */
>>>>>>>>>>>>>>>>>>>> -       int                     spdn_cnt;
>>>>>>>>>>>>>>>>>>>> +       int                     spdn_cnt; /* Number of
>>>>>>>>>>>>>>>>>>>> speed_downs
>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>> +       int                     exce_cnt; /* Number of
>>>>>>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>> happenned */
>>>>>>>>>>>>>>>>>>>>               /* ering is CLEAR_END, read comment above
>>>>>>>>>>>>>>>>>>>> CLEAR_END
>>>>>>>>>>>>>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>               struct ata_ering        ering;
>>>>>>>>>>>>>>>>>>>>        };
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the
>>>>>>>>>>>>>>>>>>> apparent
>>>>>>>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning
>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>> all.
>>>>>>>>>>>>>>>>>>> It would be better if we could figure out what was actually
>>>>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have tested the problem with three different computers, all
>>>>>>>>>>>>>>>>>> switched
>>>>>>>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this
>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>> Of
>>>>>>>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the
>>>>>>>>>>>>>>>>>> kernel
>>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to
>>>>>>>>>>>>>>>>>> reproduce
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I
>>>>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>>>> see why a device which fails so terribly that it reports 3
>>>>>>>>>>>>>>>>>> exceptions
>>>>>>>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite
>>>>>>>>>>>>>>>>>> loops.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The problem is that this could happen in some cases when you
>>>>>>>>>>>>>>>>> wouldn't
>>>>>>>>>>>>>>>>> want to disable the device, like an error that just happens
>>>>>>>>>>>>>>>>> sporadically and works on retry, or a device you're trying to
>>>>>>>>>>>>>>>>> recover
>>>>>>>>>>>>>>>>> data from.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What do you think if I edit the patch in a way, that when an
>>>>>>>>>>>>>>>> operation
>>>>>>>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well
>>>>>>>>>>>>>>>> add
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while
>>>>>>>>>>>>>>>> having
>>>>>>>>>>>>>>>> zero
>>>>>>>>>>>>>>>> as an option to never disable the device. Please don't think me
>>>>>>>>>>>>>>>> wrong,
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> don't want to force this patch, I just want to learn how all this
>>>>>>>>>>>>>>>> works,
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> in the process try to make it better. :-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That would be better, but I think you're still going to have an
>>>>>>>>>>>>>>> issue
>>>>>>>>>>>>>>> with what magic number to pick to avoid disabling devices
>>>>>>>>>>>>>>> inappropriately.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway.
>>>>>>>>>>>>>>> If someone in userspace wants to keep trying to read from that
>>>>>>>>>>>>>>> device,
>>>>>>>>>>>>>>> why would you stop them because of some arbitrary judgement? The
>>>>>>>>>>>>>>> kernel itself isn't "locked up" during this process, anything not
>>>>>>>>>>>>>>> blocked on I/O to that device should be able to continue running, so
>>>>>>>>>>>>>>> that process is only hurting itself. If the system fails to boot
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> another device due to this, this would likely point out some kind of
>>>>>>>>>>>>>>> problem in userspace or the distro boot process being overly
>>>>>>>>>>>>>>> serialized.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04,
>>>>>>>>>>>>>> and I have also tried to boot with the ubuntu install cd. They
>>>>>>>>>>>>>> couldn't
>>>>>>>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to
>>>>>>>>>>>>>> figure
>>>>>>>>>>>>>> out where and why the interrupts don't happen. Whether it be a
>>>>>>>>>>>>>> routing
>>>>>>>>>>>>>> or a hardware issue, which I highly doubt due to the fact that
>>>>>>>>>>>>>> Windows
>>>>>>>>>>>>>> XP SP2 was able to boot up without errors.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are you able to get out full dmesg output from a boot attempt and the
>>>>>>>>>>>>> contents of /proc/interrupts?
>>>>>>>>>>>>>
>>>>>>>>>>>> As I said before, I am not able to get to the shell, without my
>>>>>>>>>>>> 'symptom
>>>>>>>>>>>> cure'. With my patch I get the following dmesg output, with
>>>>>>>>>>>> some of my debug messages turned off:
>>>>>>>>>>>> http://pastebin.com/5eb5G3Dx
>>>>>>>>>>>> /proc/interrupts is here:
>>>>>>>>>>>> http://pastebin.com/84CJey2D
>>>>>>>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks
>>>>>>>>>>>> like
>>>>>>>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one,
>>>>>>>>>>>> The values I am getting from the device are very different than those
>>>>>>>>>>>> that are expected.
>>>>>>>>>>>>
>>>>>>>>>>>> Things I have noticed, but ignored in dmesg:
>>>>>>>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have
>>>>>>>>>>>> ignored
>>>>>>>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do
>>>>>>>>>>>> with
>>>>>>>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens
>>>>>>>>>>>> with /dev/sda, which works fine.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think it is likely related to the problem. The kernel thinks this
>>>>>>>>>>> controller is on IRQ 16, but apparently something is raising
>>>>>>>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ
>>>>>>>>>>> 16. It seems quite likely that this is actually the ATA controller.
>>>>>>>>>>>
>>>>>>>>>>> You mentioned that Windows XP was able to work in this mode. I wonder
>>>>>>>>>>> if it was using the IOAPIC, as if not then the IRQ routing is
>>>>>>>>>>> different which might mask the problem. Do you know what IRQ Device
>>>>>>>>>>> Manager reported for this controller in Windows? And was it using any
>>>>>>>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hmm, according to WinXP's Device manager for this controller,
>>>>>>>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC.
>>>>>>>>>> Now, one question remains where is the error that mismaps
>>>>>>>>>> controller?
>>>>>>>>>> I have created a simple patch which seems to fix this:
>>>>>>>>>> ---
>>>>>>>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev,
>>>>>>>>>> const
>>>>>>>>>> struct pci_device_id *ent)
>>>>>>>>>>                  hpriv->map = piix_init_sata_map(pdev, port_info,
>>>>>>>>>>
>>>>>>>>>> piix_map_db_table[ent->driver_data]);
>>>>>>>>>>
>>>>>>>>>> +       if(pdev->vendor == 0x8086 && pdev->device == 0x27C4)
>>>>>>>>>> +               pdev->irq = 20;
>>>>>>>>>>          rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host);
>>>>>>>>>>          if (rc)
>>>>>>>>>>                  return rc;
>>>>>>>>>>
>>>>>>>>>> However, I am more than sure that this is not the way
>>>>>>>>>> to solve this problem. Do you have any idea on where
>>>>>>>>>> the ideal place would be to implement a fix?
>>>>>>>>>> According to specs of ICH7M, which is essentially the
>>>>>>>>>> same as ICH6M, we need to check on what interrupt pin
>>>>>>>>>> is the SATA controller, and after that check which IRQ line
>>>>>>>>>> is connected to the I/O APIC and decide the IRQ's number
>>>>>>>>>> on those findings.
>>>>>>>>>>
>>>>>>>>>> Specs of ICH7:
>>>>>>>>>>
>>>>>>>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf
>>>>>>>>>> Device 31 Interrupt Route Register: Chapter 7.1.46
>>>>>>>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41
>>>>>>>>>>
>>>>>>>>>> The SATA controller is always Device 31.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It would appear that something is messing up with the ACPI IRQ routing
>>>>>>>>> on this machine that's causing us to think the controller is on the
>>>>>>>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some
>>>>>>>>> additional debugging suggestions. I suspect that dumping the DSDT is
>>>>>>>>> likely the first step though. If you can get IASL installed, you can
>>>>>>>>> do something like:
>>>>>>>>>
>>>>>>>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml
>>>>>>>>> iasl -d dsdt.aml
>>>>>>>>>
>>>>>>>>> That should spit out a dsdt.dsl file which would hopefully have the
>>>>>>>>> info needed to figure out what's going on.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Here is the disassembled DSDT table:
>>>>>>>> http://pastebin.com/LWNVht9H
>>>>>>>> The SATA controller is at line 5206.
>>>>>>>> I also disassembled the SSDT, but nothing interesting was there:
>>>>>>>> http://pastebin.com/fus5sxU8
>>>>>>>>
>>>>>>>> I disabled the usage of ACPI for IRQs with acpi=noirq,
>>>>>>>> and it successfully booted up setting itself to IRQ#3.
>>>>>>>> This makes me think that this is the BIOS's fault.
>>>>>>>> I think it would be possible to create a DMI check
>>>>>>>> and forcibly set the irq to 20 if the DMI matches.
>>>>>>>> Any comments on this?
>>>>>>>
>>>>>>> The BIOS may be doing something funky, but since Windows apparently
>>>>>>> can figure out it's on IRQ 20, Linux presumably should be able to as
>>>>>>> well. DMI checks should be the last resort - Windows almost certainly
>>>>>>> doesn't have any machine-specific logic here, and it's hard to tell
>>>>>>> what other machine models could be affected. With ACPI stuff, we
>>>>>>> generally just need to do the same thing Windows does for things to
>>>>>>> work reliably, and DMI checks are more of a hack workaround than a
>>>>>>> real fix.
>>>>>>>
>>>>>>> I'll try and have a look at the DSDT within the next few days and see
>>>>>>> if I can figure anything out, unless someone beats me to it.
>>>>>>
>>>>>> I haven't gone into too much detail, but one thing I noticed with the
>>>>>> DSDT is that there appear to be some _OSI checks for Windows 2006
>>>>>> (i.e. Vista) that seem to affect various things, including potentially
>>>>>> the PCI IRQ routing table. It's possible that their IRQ routing table
>>>>>> is broken for legacy mode with an ACPI OS supporting Vista (as current
>>>>>> Linux versions do). Could be this slipped through testing if they only
>>>>>> tested AHCI mode with Vista installed.
>>>>>>
>>>>>> You can try booting with the kernel parameters
>>>>>>
>>>>>> acpi_osi=! acpi_osi="Windows 2001 SP3"
>>>>>>
>>>>>> That should make the BIOS think we are Windows XP and bypass the Vista
>>>>>> code path. If that works, then you might want to check for a BIOS
>>>>>> update on this machine.
>>>>>>
>>>>>
>>>>> First of all, sorry for the late reply. I was kinda busy.
>>>>>
>>>>> I tried what you suggested but unfortunately the problem persists.
>>>>> This makes me believe that Windows XP does have somekind of DMI check here.
>>>>> Of course, while a BIOS update may solve this, I would prefer that Linux
>>>>> should also be able to boot up with this broken BIOS as well.
>>>>>
>>>>> If you are certain that WinXP doesn't use DMI checks,
>>>>> it could be that WinXP's driver of ICH7M's SATA controller applies
>>>>> a quirk and sets that irq line to #20.
>>>>
>>>> Can you post the dmesg output from a bootup attempt with those options?
>>>>
>>>> You may also want to try adding just: acpi_osi=!
>>>>
>>>
>>> None of the 3 possible combinations succeeded to boot.
>>>
>>> Here are a couple of dmesgs:
>>>
>>> Params: acpi_osi="Windows 2001 SP3"
>>> http://pastebin.com/vF3BSuhc
>>>
>>> Params: acpi_osi=! acpi_osi="Windows 2001 SP3"
>>> http://pastebin.com/BuUzc3es
>>>
>>> Params: acpi_osi=!
>>> http://pastebin.com/u7uRx8Ru
>>
>> I'm not sure the option is actually taking effect properly. There
>> should be a message "Disabled all _OSI OS vendors" that shows up in
>> dmesg with the ! option. Can you try:
>>
>> acpi_osi="!" acpi_osi="Windows 2001 SP3"
>>
>> (with the quotes around the ! character).
>>
>
> The following command line worked:
> acpi_osi= acpi_osi="Windows 2001 SP3"
>
> So, it seems that the BIOS is broken. Is there any way to fix this,
> without resorting to the hackish DMI checks?

Probably not really. Have you checked for a newer BIOS version on this machine?

If not, this is likely similar to a number of other systems listed in
acpi_osi_dmi_table in drivers/acpi/blacklist.c which need to disable
reporting Vista support.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html