On Sun, Oct 13, 2013 at 6:02 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: > 2013-10-13 07:57 keltezéssel, Robert Hancock írta: >> On Sat, Oct 12, 2013 at 3:29 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>> 2013-10-12 04:06 keltezéssel, Robert Hancock írta: >>>> On Fri, Oct 11, 2013 at 10:07 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>> 2013-10-01 06:25 keltezéssel, Robert Hancock írta: >>>>>> On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@xxxxxxxxx> wrote: >>>>>>> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>>>>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta: >>>>>>>> >>>>>>>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta: >>>>>>>>>> >>>>>>>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta: >>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@xxxxxxxxx> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop. >>>>>>>>>>>>>>>>>>>> dmesg: >>>>>>>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50) >>>>>>>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 >>>>>>>>>>>>>>>>>>>> frozen >>>>>>>>>>>>>>>>>>>> ata3.00: failed command: READ DMA >>>>>>>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 >>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>> res 40/00:00:00:00:00/00:00:00:00:00/00 >>>>>>>>>>>>>>>>>>>> Emask >>>>>>>>>>>>>>>>>>>> 0x4 >>>>>>>>>>>>>>>>>>>> (timeout) >>>>>>>>>>>>>>>>>>>> ata3.00: status: { DRDY } >>>>>>>>>>>>>>>>>>>> ata3: soft resetting link >>>>>>>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error) >>>>>>>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0 >>>>>>>>>>>>>>>>>>>> ata3: EH complete >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Patch that fixes the infinite loop: >>>>>>>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c >>>>>>>>>>>>>>>>>>>> index f9476fb..eeedf80 100644 >>>>>>>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c >>>>>>>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c >>>>>>>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct >>>>>>>>>>>>>>>>>>>> ata_link >>>>>>>>>>>>>>>>>>>> *link) >>>>>>>>>>>>>>>>>>>> ehc->i.action, frozen, >>>>>>>>>>>>>>>>>>>> tries_buf); >>>>>>>>>>>>>>>>>>>> if (desc) >>>>>>>>>>>>>>>>>>>> ata_dev_err(ehc->i.dev, "%s\n", >>>>>>>>>>>>>>>>>>>> desc); >>>>>>>>>>>>>>>>>>>> + ehc->i.dev->exce_cnt ++; >>>>>>>>>>>>>>>>>>>> + ata_dev_warn(ehc->i.dev, "Number of exceptions: >>>>>>>>>>>>>>>>>>>> %d\n", >>>>>>>>>>>>>>>>>>>> ehc->i.dev->exce_cnt); >>>>>>>>>>>>>>>>>>>> + /** >>>>>>>>>>>>>>>>>>>> + * The device is failing terribly, >>>>>>>>>>>>>>>>>>>> + * disable it to prevent damage. >>>>>>>>>>>>>>>>>>>> + */ >>>>>>>>>>>>>>>>>>>> + if(ehc->i.dev->exce_cnt > 2) >>>>>>>>>>>>>>>>>>>> + ata_dev_disable(ehc->i.dev); >>>>>>>>>>>>>>>>>>>> } else { >>>>>>>>>>>>>>>>>>>> ata_link_err(link, "exception Emask 0x%x >>>>>>>>>>>>>>>>>>>> " >>>>>>>>>>>>>>>>>>>> "SAct 0x%x SErr 0x%x action >>>>>>>>>>>>>>>>>>>> 0x%x%s%s\n", >>>>>>>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h >>>>>>>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644 >>>>>>>>>>>>>>>>>>>> --- a/include/linux/libata.h >>>>>>>>>>>>>>>>>>>> +++ b/include/linux/libata.h >>>>>>>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device { >>>>>>>>>>>>>>>>>>>> u8 >>>>>>>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE]; >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> /* error history */ >>>>>>>>>>>>>>>>>>>> - int spdn_cnt; >>>>>>>>>>>>>>>>>>>> + int spdn_cnt; /* Number of >>>>>>>>>>>>>>>>>>>> speed_downs >>>>>>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>>>>>> + int exce_cnt; /* Number of >>>>>>>>>>>>>>>>>>>> exceptions >>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>> happenned */ >>>>>>>>>>>>>>>>>>>> /* ering is CLEAR_END, read comment above >>>>>>>>>>>>>>>>>>>> CLEAR_END >>>>>>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>>>>>> struct ata_ering ering; >>>>>>>>>>>>>>>>>>>> }; >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the >>>>>>>>>>>>>>>>>>> apparent >>>>>>>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning >>>>>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>>>>> all. >>>>>>>>>>>>>>>>>>> It would be better if we could figure out what was actually >>>>>>>>>>>>>>>>>>> going >>>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I have tested the problem with three different computers, all >>>>>>>>>>>>>>>>>> switched >>>>>>>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this >>>>>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>>>>> Of >>>>>>>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the >>>>>>>>>>>>>>>>>> kernel >>>>>>>>>>>>>>>>>> would >>>>>>>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to >>>>>>>>>>>>>>>>>> reproduce >>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I >>>>>>>>>>>>>>>>>> still >>>>>>>>>>>>>>>>>> don't >>>>>>>>>>>>>>>>>> see why a device which fails so terribly that it reports 3 >>>>>>>>>>>>>>>>>> exceptions >>>>>>>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite >>>>>>>>>>>>>>>>>> loops. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The problem is that this could happen in some cases when you >>>>>>>>>>>>>>>>> wouldn't >>>>>>>>>>>>>>>>> want to disable the device, like an error that just happens >>>>>>>>>>>>>>>>> sporadically and works on retry, or a device you're trying to >>>>>>>>>>>>>>>>> recover >>>>>>>>>>>>>>>>> data from. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What do you think if I edit the patch in a way, that when an >>>>>>>>>>>>>>>> operation >>>>>>>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well >>>>>>>>>>>>>>>> add >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while >>>>>>>>>>>>>>>> having >>>>>>>>>>>>>>>> zero >>>>>>>>>>>>>>>> as an option to never disable the device. Please don't think me >>>>>>>>>>>>>>>> wrong, >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>> don't want to force this patch, I just want to learn how all this >>>>>>>>>>>>>>>> works, >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>> in the process try to make it better. :-) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> That would be better, but I think you're still going to have an >>>>>>>>>>>>>>> issue >>>>>>>>>>>>>>> with what magic number to pick to avoid disabling devices >>>>>>>>>>>>>>> inappropriately. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway. >>>>>>>>>>>>>>> If someone in userspace wants to keep trying to read from that >>>>>>>>>>>>>>> device, >>>>>>>>>>>>>>> why would you stop them because of some arbitrary judgement? The >>>>>>>>>>>>>>> kernel itself isn't "locked up" during this process, anything not >>>>>>>>>>>>>>> blocked on I/O to that device should be able to continue running, so >>>>>>>>>>>>>>> that process is only hurting itself. If the system fails to boot >>>>>>>>>>>>>>> from >>>>>>>>>>>>>>> another device due to this, this would likely point out some kind of >>>>>>>>>>>>>>> problem in userspace or the distro boot process being overly >>>>>>>>>>>>>>> serialized. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04, >>>>>>>>>>>>>> and I have also tried to boot with the ubuntu install cd. They >>>>>>>>>>>>>> couldn't >>>>>>>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to >>>>>>>>>>>>>> figure >>>>>>>>>>>>>> out where and why the interrupts don't happen. Whether it be a >>>>>>>>>>>>>> routing >>>>>>>>>>>>>> or a hardware issue, which I highly doubt due to the fact that >>>>>>>>>>>>>> Windows >>>>>>>>>>>>>> XP SP2 was able to boot up without errors. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Are you able to get out full dmesg output from a boot attempt and the >>>>>>>>>>>>> contents of /proc/interrupts? >>>>>>>>>>>>> >>>>>>>>>>>> As I said before, I am not able to get to the shell, without my >>>>>>>>>>>> 'symptom >>>>>>>>>>>> cure'. With my patch I get the following dmesg output, with >>>>>>>>>>>> some of my debug messages turned off: >>>>>>>>>>>> http://pastebin.com/5eb5G3Dx >>>>>>>>>>>> /proc/interrupts is here: >>>>>>>>>>>> http://pastebin.com/84CJey2D >>>>>>>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks >>>>>>>>>>>> like >>>>>>>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one, >>>>>>>>>>>> The values I am getting from the device are very different than those >>>>>>>>>>>> that are expected. >>>>>>>>>>>> >>>>>>>>>>>> Things I have noticed, but ignored in dmesg: >>>>>>>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have >>>>>>>>>>>> ignored >>>>>>>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do >>>>>>>>>>>> with >>>>>>>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens >>>>>>>>>>>> with /dev/sda, which works fine. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I think it is likely related to the problem. The kernel thinks this >>>>>>>>>>> controller is on IRQ 16, but apparently something is raising >>>>>>>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ >>>>>>>>>>> 16. It seems quite likely that this is actually the ATA controller. >>>>>>>>>>> >>>>>>>>>>> You mentioned that Windows XP was able to work in this mode. I wonder >>>>>>>>>>> if it was using the IOAPIC, as if not then the IRQ routing is >>>>>>>>>>> different which might mask the problem. Do you know what IRQ Device >>>>>>>>>>> Manager reported for this controller in Windows? And was it using any >>>>>>>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hmm, according to WinXP's Device manager for this controller, >>>>>>>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC. >>>>>>>>>> Now, one question remains where is the error that mismaps >>>>>>>>>> controller? >>>>>>>>>> I have created a simple patch which seems to fix this: >>>>>>>>>> --- >>>>>>>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev, >>>>>>>>>> const >>>>>>>>>> struct pci_device_id *ent) >>>>>>>>>> hpriv->map = piix_init_sata_map(pdev, port_info, >>>>>>>>>> >>>>>>>>>> piix_map_db_table[ent->driver_data]); >>>>>>>>>> >>>>>>>>>> + if(pdev->vendor == 0x8086 && pdev->device == 0x27C4) >>>>>>>>>> + pdev->irq = 20; >>>>>>>>>> rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host); >>>>>>>>>> if (rc) >>>>>>>>>> return rc; >>>>>>>>>> >>>>>>>>>> However, I am more than sure that this is not the way >>>>>>>>>> to solve this problem. Do you have any idea on where >>>>>>>>>> the ideal place would be to implement a fix? >>>>>>>>>> According to specs of ICH7M, which is essentially the >>>>>>>>>> same as ICH6M, we need to check on what interrupt pin >>>>>>>>>> is the SATA controller, and after that check which IRQ line >>>>>>>>>> is connected to the I/O APIC and decide the IRQ's number >>>>>>>>>> on those findings. >>>>>>>>>> >>>>>>>>>> Specs of ICH7: >>>>>>>>>> >>>>>>>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf >>>>>>>>>> Device 31 Interrupt Route Register: Chapter 7.1.46 >>>>>>>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41 >>>>>>>>>> >>>>>>>>>> The SATA controller is always Device 31. >>>>>>>>> >>>>>>>>> >>>>>>>>> It would appear that something is messing up with the ACPI IRQ routing >>>>>>>>> on this machine that's causing us to think the controller is on the >>>>>>>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some >>>>>>>>> additional debugging suggestions. I suspect that dumping the DSDT is >>>>>>>>> likely the first step though. If you can get IASL installed, you can >>>>>>>>> do something like: >>>>>>>>> >>>>>>>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml >>>>>>>>> iasl -d dsdt.aml >>>>>>>>> >>>>>>>>> That should spit out a dsdt.dsl file which would hopefully have the >>>>>>>>> info needed to figure out what's going on. >>>>>>>>> >>>>>>>> >>>>>>>> Here is the disassembled DSDT table: >>>>>>>> http://pastebin.com/LWNVht9H >>>>>>>> The SATA controller is at line 5206. >>>>>>>> I also disassembled the SSDT, but nothing interesting was there: >>>>>>>> http://pastebin.com/fus5sxU8 >>>>>>>> >>>>>>>> I disabled the usage of ACPI for IRQs with acpi=noirq, >>>>>>>> and it successfully booted up setting itself to IRQ#3. >>>>>>>> This makes me think that this is the BIOS's fault. >>>>>>>> I think it would be possible to create a DMI check >>>>>>>> and forcibly set the irq to 20 if the DMI matches. >>>>>>>> Any comments on this? >>>>>>> >>>>>>> The BIOS may be doing something funky, but since Windows apparently >>>>>>> can figure out it's on IRQ 20, Linux presumably should be able to as >>>>>>> well. DMI checks should be the last resort - Windows almost certainly >>>>>>> doesn't have any machine-specific logic here, and it's hard to tell >>>>>>> what other machine models could be affected. With ACPI stuff, we >>>>>>> generally just need to do the same thing Windows does for things to >>>>>>> work reliably, and DMI checks are more of a hack workaround than a >>>>>>> real fix. >>>>>>> >>>>>>> I'll try and have a look at the DSDT within the next few days and see >>>>>>> if I can figure anything out, unless someone beats me to it. >>>>>> >>>>>> I haven't gone into too much detail, but one thing I noticed with the >>>>>> DSDT is that there appear to be some _OSI checks for Windows 2006 >>>>>> (i.e. Vista) that seem to affect various things, including potentially >>>>>> the PCI IRQ routing table. It's possible that their IRQ routing table >>>>>> is broken for legacy mode with an ACPI OS supporting Vista (as current >>>>>> Linux versions do). Could be this slipped through testing if they only >>>>>> tested AHCI mode with Vista installed. >>>>>> >>>>>> You can try booting with the kernel parameters >>>>>> >>>>>> acpi_osi=! acpi_osi="Windows 2001 SP3" >>>>>> >>>>>> That should make the BIOS think we are Windows XP and bypass the Vista >>>>>> code path. If that works, then you might want to check for a BIOS >>>>>> update on this machine. >>>>>> >>>>> >>>>> First of all, sorry for the late reply. I was kinda busy. >>>>> >>>>> I tried what you suggested but unfortunately the problem persists. >>>>> This makes me believe that Windows XP does have somekind of DMI check here. >>>>> Of course, while a BIOS update may solve this, I would prefer that Linux >>>>> should also be able to boot up with this broken BIOS as well. >>>>> >>>>> If you are certain that WinXP doesn't use DMI checks, >>>>> it could be that WinXP's driver of ICH7M's SATA controller applies >>>>> a quirk and sets that irq line to #20. >>>> >>>> Can you post the dmesg output from a bootup attempt with those options? >>>> >>>> You may also want to try adding just: acpi_osi=! >>>> >>> >>> None of the 3 possible combinations succeeded to boot. >>> >>> Here are a couple of dmesgs: >>> >>> Params: acpi_osi="Windows 2001 SP3" >>> http://pastebin.com/vF3BSuhc >>> >>> Params: acpi_osi=! acpi_osi="Windows 2001 SP3" >>> http://pastebin.com/BuUzc3es >>> >>> Params: acpi_osi=! >>> http://pastebin.com/u7uRx8Ru >> >> I'm not sure the option is actually taking effect properly. There >> should be a message "Disabled all _OSI OS vendors" that shows up in >> dmesg with the ! option. Can you try: >> >> acpi_osi="!" acpi_osi="Windows 2001 SP3" >> >> (with the quotes around the ! character). >> > > The following command line worked: > acpi_osi= acpi_osi="Windows 2001 SP3" > > So, it seems that the BIOS is broken. Is there any way to fix this, > without resorting to the hackish DMI checks? Probably not really. Have you checked for a newer BIOS version on this machine? If not, this is likely similar to a number of other systems listed in acpi_osi_dmi_table in drivers/acpi/blacklist.c which need to disable reporting Vista support. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html