On Sat, Oct 12, 2013 at 3:29 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: > 2013-10-12 04:06 keltezéssel, Robert Hancock írta: >> On Fri, Oct 11, 2013 at 10:07 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>> 2013-10-01 06:25 keltezéssel, Robert Hancock írta: >>>> On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@xxxxxxxxx> wrote: >>>>> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta: >>>>>> >>>>>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>>>>> >>>>>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta: >>>>>>>> >>>>>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta: >>>>>>>>>> >>>>>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@xxxxxxxxx> >>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop. >>>>>>>>>>>>>>>>>> dmesg: >>>>>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50) >>>>>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 >>>>>>>>>>>>>>>>>> frozen >>>>>>>>>>>>>>>>>> ata3.00: failed command: READ DMA >>>>>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 >>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>> res 40/00:00:00:00:00/00:00:00:00:00/00 >>>>>>>>>>>>>>>>>> Emask >>>>>>>>>>>>>>>>>> 0x4 >>>>>>>>>>>>>>>>>> (timeout) >>>>>>>>>>>>>>>>>> ata3.00: status: { DRDY } >>>>>>>>>>>>>>>>>> ata3: soft resetting link >>>>>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error) >>>>>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0 >>>>>>>>>>>>>>>>>> ata3: EH complete >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Patch that fixes the infinite loop: >>>>>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c >>>>>>>>>>>>>>>>>> index f9476fb..eeedf80 100644 >>>>>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c >>>>>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c >>>>>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct >>>>>>>>>>>>>>>>>> ata_link >>>>>>>>>>>>>>>>>> *link) >>>>>>>>>>>>>>>>>> ehc->i.action, frozen, >>>>>>>>>>>>>>>>>> tries_buf); >>>>>>>>>>>>>>>>>> if (desc) >>>>>>>>>>>>>>>>>> ata_dev_err(ehc->i.dev, "%s\n", >>>>>>>>>>>>>>>>>> desc); >>>>>>>>>>>>>>>>>> + ehc->i.dev->exce_cnt ++; >>>>>>>>>>>>>>>>>> + ata_dev_warn(ehc->i.dev, "Number of exceptions: >>>>>>>>>>>>>>>>>> %d\n", >>>>>>>>>>>>>>>>>> ehc->i.dev->exce_cnt); >>>>>>>>>>>>>>>>>> + /** >>>>>>>>>>>>>>>>>> + * The device is failing terribly, >>>>>>>>>>>>>>>>>> + * disable it to prevent damage. >>>>>>>>>>>>>>>>>> + */ >>>>>>>>>>>>>>>>>> + if(ehc->i.dev->exce_cnt > 2) >>>>>>>>>>>>>>>>>> + ata_dev_disable(ehc->i.dev); >>>>>>>>>>>>>>>>>> } else { >>>>>>>>>>>>>>>>>> ata_link_err(link, "exception Emask 0x%x >>>>>>>>>>>>>>>>>> " >>>>>>>>>>>>>>>>>> "SAct 0x%x SErr 0x%x action >>>>>>>>>>>>>>>>>> 0x%x%s%s\n", >>>>>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h >>>>>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644 >>>>>>>>>>>>>>>>>> --- a/include/linux/libata.h >>>>>>>>>>>>>>>>>> +++ b/include/linux/libata.h >>>>>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device { >>>>>>>>>>>>>>>>>> u8 >>>>>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE]; >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> /* error history */ >>>>>>>>>>>>>>>>>> - int spdn_cnt; >>>>>>>>>>>>>>>>>> + int spdn_cnt; /* Number of >>>>>>>>>>>>>>>>>> speed_downs >>>>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>>>> + int exce_cnt; /* Number of >>>>>>>>>>>>>>>>>> exceptions >>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>> happenned */ >>>>>>>>>>>>>>>>>> /* ering is CLEAR_END, read comment above >>>>>>>>>>>>>>>>>> CLEAR_END >>>>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>>>> struct ata_ering ering; >>>>>>>>>>>>>>>>>> }; >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the >>>>>>>>>>>>>>>>> apparent >>>>>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning >>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>> all. >>>>>>>>>>>>>>>>> It would be better if we could figure out what was actually >>>>>>>>>>>>>>>>> going >>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have tested the problem with three different computers, all >>>>>>>>>>>>>>>> switched >>>>>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this >>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>> Of >>>>>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the >>>>>>>>>>>>>>>> kernel >>>>>>>>>>>>>>>> would >>>>>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to >>>>>>>>>>>>>>>> reproduce >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I >>>>>>>>>>>>>>>> still >>>>>>>>>>>>>>>> don't >>>>>>>>>>>>>>>> see why a device which fails so terribly that it reports 3 >>>>>>>>>>>>>>>> exceptions >>>>>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite >>>>>>>>>>>>>>>> loops. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The problem is that this could happen in some cases when you >>>>>>>>>>>>>>> wouldn't >>>>>>>>>>>>>>> want to disable the device, like an error that just happens >>>>>>>>>>>>>>> sporadically and works on retry, or a device you're trying to >>>>>>>>>>>>>>> recover >>>>>>>>>>>>>>> data from. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> What do you think if I edit the patch in a way, that when an >>>>>>>>>>>>>> operation >>>>>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well >>>>>>>>>>>>>> add >>>>>>>>>>>>>> a >>>>>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while >>>>>>>>>>>>>> having >>>>>>>>>>>>>> zero >>>>>>>>>>>>>> as an option to never disable the device. Please don't think me >>>>>>>>>>>>>> wrong, >>>>>>>>>>>>>> I >>>>>>>>>>>>>> don't want to force this patch, I just want to learn how all this >>>>>>>>>>>>>> works, >>>>>>>>>>>>>> and >>>>>>>>>>>>>> in the process try to make it better. :-) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> That would be better, but I think you're still going to have an >>>>>>>>>>>>> issue >>>>>>>>>>>>> with what magic number to pick to avoid disabling devices >>>>>>>>>>>>> inappropriately. >>>>>>>>>>>>> >>>>>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway. >>>>>>>>>>>>> If someone in userspace wants to keep trying to read from that >>>>>>>>>>>>> device, >>>>>>>>>>>>> why would you stop them because of some arbitrary judgement? The >>>>>>>>>>>>> kernel itself isn't "locked up" during this process, anything not >>>>>>>>>>>>> blocked on I/O to that device should be able to continue running, so >>>>>>>>>>>>> that process is only hurting itself. If the system fails to boot >>>>>>>>>>>>> from >>>>>>>>>>>>> another device due to this, this would likely point out some kind of >>>>>>>>>>>>> problem in userspace or the distro boot process being overly >>>>>>>>>>>>> serialized. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04, >>>>>>>>>>>> and I have also tried to boot with the ubuntu install cd. They >>>>>>>>>>>> couldn't >>>>>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to >>>>>>>>>>>> figure >>>>>>>>>>>> out where and why the interrupts don't happen. Whether it be a >>>>>>>>>>>> routing >>>>>>>>>>>> or a hardware issue, which I highly doubt due to the fact that >>>>>>>>>>>> Windows >>>>>>>>>>>> XP SP2 was able to boot up without errors. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Are you able to get out full dmesg output from a boot attempt and the >>>>>>>>>>> contents of /proc/interrupts? >>>>>>>>>>> >>>>>>>>>> As I said before, I am not able to get to the shell, without my >>>>>>>>>> 'symptom >>>>>>>>>> cure'. With my patch I get the following dmesg output, with >>>>>>>>>> some of my debug messages turned off: >>>>>>>>>> http://pastebin.com/5eb5G3Dx >>>>>>>>>> /proc/interrupts is here: >>>>>>>>>> http://pastebin.com/84CJey2D >>>>>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks >>>>>>>>>> like >>>>>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one, >>>>>>>>>> The values I am getting from the device are very different than those >>>>>>>>>> that are expected. >>>>>>>>>> >>>>>>>>>> Things I have noticed, but ignored in dmesg: >>>>>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have >>>>>>>>>> ignored >>>>>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do >>>>>>>>>> with >>>>>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens >>>>>>>>>> with /dev/sda, which works fine. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I think it is likely related to the problem. The kernel thinks this >>>>>>>>> controller is on IRQ 16, but apparently something is raising >>>>>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ >>>>>>>>> 16. It seems quite likely that this is actually the ATA controller. >>>>>>>>> >>>>>>>>> You mentioned that Windows XP was able to work in this mode. I wonder >>>>>>>>> if it was using the IOAPIC, as if not then the IRQ routing is >>>>>>>>> different which might mask the problem. Do you know what IRQ Device >>>>>>>>> Manager reported for this controller in Windows? And was it using any >>>>>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hmm, according to WinXP's Device manager for this controller, >>>>>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC. >>>>>>>> Now, one question remains where is the error that mismaps >>>>>>>> controller? >>>>>>>> I have created a simple patch which seems to fix this: >>>>>>>> --- >>>>>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev, >>>>>>>> const >>>>>>>> struct pci_device_id *ent) >>>>>>>> hpriv->map = piix_init_sata_map(pdev, port_info, >>>>>>>> >>>>>>>> piix_map_db_table[ent->driver_data]); >>>>>>>> >>>>>>>> + if(pdev->vendor == 0x8086 && pdev->device == 0x27C4) >>>>>>>> + pdev->irq = 20; >>>>>>>> rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host); >>>>>>>> if (rc) >>>>>>>> return rc; >>>>>>>> >>>>>>>> However, I am more than sure that this is not the way >>>>>>>> to solve this problem. Do you have any idea on where >>>>>>>> the ideal place would be to implement a fix? >>>>>>>> According to specs of ICH7M, which is essentially the >>>>>>>> same as ICH6M, we need to check on what interrupt pin >>>>>>>> is the SATA controller, and after that check which IRQ line >>>>>>>> is connected to the I/O APIC and decide the IRQ's number >>>>>>>> on those findings. >>>>>>>> >>>>>>>> Specs of ICH7: >>>>>>>> >>>>>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf >>>>>>>> Device 31 Interrupt Route Register: Chapter 7.1.46 >>>>>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41 >>>>>>>> >>>>>>>> The SATA controller is always Device 31. >>>>>>> >>>>>>> >>>>>>> It would appear that something is messing up with the ACPI IRQ routing >>>>>>> on this machine that's causing us to think the controller is on the >>>>>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some >>>>>>> additional debugging suggestions. I suspect that dumping the DSDT is >>>>>>> likely the first step though. If you can get IASL installed, you can >>>>>>> do something like: >>>>>>> >>>>>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml >>>>>>> iasl -d dsdt.aml >>>>>>> >>>>>>> That should spit out a dsdt.dsl file which would hopefully have the >>>>>>> info needed to figure out what's going on. >>>>>>> >>>>>> >>>>>> Here is the disassembled DSDT table: >>>>>> http://pastebin.com/LWNVht9H >>>>>> The SATA controller is at line 5206. >>>>>> I also disassembled the SSDT, but nothing interesting was there: >>>>>> http://pastebin.com/fus5sxU8 >>>>>> >>>>>> I disabled the usage of ACPI for IRQs with acpi=noirq, >>>>>> and it successfully booted up setting itself to IRQ#3. >>>>>> This makes me think that this is the BIOS's fault. >>>>>> I think it would be possible to create a DMI check >>>>>> and forcibly set the irq to 20 if the DMI matches. >>>>>> Any comments on this? >>>>> >>>>> The BIOS may be doing something funky, but since Windows apparently >>>>> can figure out it's on IRQ 20, Linux presumably should be able to as >>>>> well. DMI checks should be the last resort - Windows almost certainly >>>>> doesn't have any machine-specific logic here, and it's hard to tell >>>>> what other machine models could be affected. With ACPI stuff, we >>>>> generally just need to do the same thing Windows does for things to >>>>> work reliably, and DMI checks are more of a hack workaround than a >>>>> real fix. >>>>> >>>>> I'll try and have a look at the DSDT within the next few days and see >>>>> if I can figure anything out, unless someone beats me to it. >>>> >>>> I haven't gone into too much detail, but one thing I noticed with the >>>> DSDT is that there appear to be some _OSI checks for Windows 2006 >>>> (i.e. Vista) that seem to affect various things, including potentially >>>> the PCI IRQ routing table. It's possible that their IRQ routing table >>>> is broken for legacy mode with an ACPI OS supporting Vista (as current >>>> Linux versions do). Could be this slipped through testing if they only >>>> tested AHCI mode with Vista installed. >>>> >>>> You can try booting with the kernel parameters >>>> >>>> acpi_osi=! acpi_osi="Windows 2001 SP3" >>>> >>>> That should make the BIOS think we are Windows XP and bypass the Vista >>>> code path. If that works, then you might want to check for a BIOS >>>> update on this machine. >>>> >>> >>> First of all, sorry for the late reply. I was kinda busy. >>> >>> I tried what you suggested but unfortunately the problem persists. >>> This makes me believe that Windows XP does have somekind of DMI check here. >>> Of course, while a BIOS update may solve this, I would prefer that Linux >>> should also be able to boot up with this broken BIOS as well. >>> >>> If you are certain that WinXP doesn't use DMI checks, >>> it could be that WinXP's driver of ICH7M's SATA controller applies >>> a quirk and sets that irq line to #20. >> >> Can you post the dmesg output from a bootup attempt with those options? >> >> You may also want to try adding just: acpi_osi=! >> > > None of the 3 possible combinations succeeded to boot. > > Here are a couple of dmesgs: > > Params: acpi_osi="Windows 2001 SP3" > http://pastebin.com/vF3BSuhc > > Params: acpi_osi=! acpi_osi="Windows 2001 SP3" > http://pastebin.com/BuUzc3es > > Params: acpi_osi=! > http://pastebin.com/u7uRx8Ru I'm not sure the option is actually taking effect properly. There should be a message "Disabled all _OSI OS vendors" that shows up in dmesg with the ! option. Can you try: acpi_osi="!" acpi_osi="Windows 2001 SP3" (with the quotes around the ! character). -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html