2013-10-12 04:06 keltezéssel, Robert Hancock írta: > On Fri, Oct 11, 2013 at 10:07 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >> 2013-10-01 06:25 keltezéssel, Robert Hancock írta: >>> On Sat, Sep 28, 2013 at 7:21 PM, Robert Hancock <hancockrwd@xxxxxxxxx> wrote: >>>> On Sat, Sep 28, 2013 at 11:46 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>> 2013-09-28 06:55 keltezéssel, Robert Hancock írta: >>>>> >>>>>> On Fri, Sep 27, 2013 at 7:24 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>>>> >>>>>>> 2013-09-25 08:31 keltezéssel, Robert Hancock írta: >>>>>>> >>>>>>>> On Sun, Sep 22, 2013 at 1:13 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> 2013-09-21 19:04 keltezéssel, Robert Hancock írta: >>>>>>>>> >>>>>>>>>> On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@xxxxxxxxx> >>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The following dmesg is stuck in an infinite loop. >>>>>>>>>>>>>>>>> dmesg: >>>>>>>>>>>>>>>>> ata3: lost interrupt (Status 0x50) >>>>>>>>>>>>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 >>>>>>>>>>>>>>>>> frozen >>>>>>>>>>>>>>>>> ata3.00: failed command: READ DMA >>>>>>>>>>>>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 >>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>> res 40/00:00:00:00:00/00:00:00:00:00/00 >>>>>>>>>>>>>>>>> Emask >>>>>>>>>>>>>>>>> 0x4 >>>>>>>>>>>>>>>>> (timeout) >>>>>>>>>>>>>>>>> ata3.00: status: { DRDY } >>>>>>>>>>>>>>>>> ata3: soft resetting link >>>>>>>>>>>>>>>>> ata3.00: configured for UDMA/33 (no error) >>>>>>>>>>>>>>>>> ata3.00: device reported invalid CHS sector 0 >>>>>>>>>>>>>>>>> ata3: EH complete >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Patch that fixes the infinite loop: >>>>>>>>>>>>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c >>>>>>>>>>>>>>>>> index f9476fb..eeedf80 100644 >>>>>>>>>>>>>>>>> --- a/drivers/ata/libata-eh.c >>>>>>>>>>>>>>>>> +++ b/drivers/ata/libata-eh.c >>>>>>>>>>>>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct >>>>>>>>>>>>>>>>> ata_link >>>>>>>>>>>>>>>>> *link) >>>>>>>>>>>>>>>>> ehc->i.action, frozen, >>>>>>>>>>>>>>>>> tries_buf); >>>>>>>>>>>>>>>>> if (desc) >>>>>>>>>>>>>>>>> ata_dev_err(ehc->i.dev, "%s\n", >>>>>>>>>>>>>>>>> desc); >>>>>>>>>>>>>>>>> + ehc->i.dev->exce_cnt ++; >>>>>>>>>>>>>>>>> + ata_dev_warn(ehc->i.dev, "Number of exceptions: >>>>>>>>>>>>>>>>> %d\n", >>>>>>>>>>>>>>>>> ehc->i.dev->exce_cnt); >>>>>>>>>>>>>>>>> + /** >>>>>>>>>>>>>>>>> + * The device is failing terribly, >>>>>>>>>>>>>>>>> + * disable it to prevent damage. >>>>>>>>>>>>>>>>> + */ >>>>>>>>>>>>>>>>> + if(ehc->i.dev->exce_cnt > 2) >>>>>>>>>>>>>>>>> + ata_dev_disable(ehc->i.dev); >>>>>>>>>>>>>>>>> } else { >>>>>>>>>>>>>>>>> ata_link_err(link, "exception Emask 0x%x >>>>>>>>>>>>>>>>> " >>>>>>>>>>>>>>>>> "SAct 0x%x SErr 0x%x action >>>>>>>>>>>>>>>>> 0x%x%s%s\n", >>>>>>>>>>>>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h >>>>>>>>>>>>>>>>> index eae7a05..fa52ee6 100644 >>>>>>>>>>>>>>>>> --- a/include/linux/libata.h >>>>>>>>>>>>>>>>> +++ b/include/linux/libata.h >>>>>>>>>>>>>>>>> @@ -660,7 +660,8 @@ struct ata_device { >>>>>>>>>>>>>>>>> u8 >>>>>>>>>>>>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE]; >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /* error history */ >>>>>>>>>>>>>>>>> - int spdn_cnt; >>>>>>>>>>>>>>>>> + int spdn_cnt; /* Number of >>>>>>>>>>>>>>>>> speed_downs >>>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>>> + int exce_cnt; /* Number of >>>>>>>>>>>>>>>>> exceptions >>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>> happenned */ >>>>>>>>>>>>>>>>> /* ering is CLEAR_END, read comment above >>>>>>>>>>>>>>>>> CLEAR_END >>>>>>>>>>>>>>>>> */ >>>>>>>>>>>>>>>>> struct ata_ering ering; >>>>>>>>>>>>>>>>> }; >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This doesn't seem like a very good fix. It may prevent the >>>>>>>>>>>>>>>> apparent >>>>>>>>>>>>>>>> infinite loop but will just prevent that device from functioning >>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>> all. >>>>>>>>>>>>>>>> It would be better if we could figure out what was actually >>>>>>>>>>>>>>>> going >>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have tested the problem with three different computers, all >>>>>>>>>>>>>>> switched >>>>>>>>>>>>>>> to legacy/IDE/compatibility mode, and they didn't have this >>>>>>>>>>>>>>> problem. >>>>>>>>>>>>>>> Of >>>>>>>>>>>>>>> course, they could have been set to AHCI mode, and there the >>>>>>>>>>>>>>> kernel >>>>>>>>>>>>>>> would >>>>>>>>>>>>>>> boot normally. Feels strange, but so far I was only able to >>>>>>>>>>>>>>> reproduce >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I >>>>>>>>>>>>>>> still >>>>>>>>>>>>>>> don't >>>>>>>>>>>>>>> see why a device which fails so terribly that it reports 3 >>>>>>>>>>>>>>> exceptions >>>>>>>>>>>>>>> shouldn't be disabled. Like in this case, it could cause infinite >>>>>>>>>>>>>>> loops. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> The problem is that this could happen in some cases when you >>>>>>>>>>>>>> wouldn't >>>>>>>>>>>>>> want to disable the device, like an error that just happens >>>>>>>>>>>>>> sporadically and works on retry, or a device you're trying to >>>>>>>>>>>>>> recover >>>>>>>>>>>>>> data from. >>>>>>>>>>>>>> >>>>>>>>>>>>> What do you think if I edit the patch in a way, that when an >>>>>>>>>>>>> operation >>>>>>>>>>>>> successfully completes, it resets exce_cnt to zero. Might as well >>>>>>>>>>>>> add >>>>>>>>>>>>> a >>>>>>>>>>>>> module_param, which can set the maximum value of exce_cnt, while >>>>>>>>>>>>> having >>>>>>>>>>>>> zero >>>>>>>>>>>>> as an option to never disable the device. Please don't think me >>>>>>>>>>>>> wrong, >>>>>>>>>>>>> I >>>>>>>>>>>>> don't want to force this patch, I just want to learn how all this >>>>>>>>>>>>> works, >>>>>>>>>>>>> and >>>>>>>>>>>>> in the process try to make it better. :-) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> That would be better, but I think you're still going to have an >>>>>>>>>>>> issue >>>>>>>>>>>> with what magic number to pick to avoid disabling devices >>>>>>>>>>>> inappropriately. >>>>>>>>>>>> >>>>>>>>>>>> Conceptually, disabling the device doesn't really make sense anyway. >>>>>>>>>>>> If someone in userspace wants to keep trying to read from that >>>>>>>>>>>> device, >>>>>>>>>>>> why would you stop them because of some arbitrary judgement? The >>>>>>>>>>>> kernel itself isn't "locked up" during this process, anything not >>>>>>>>>>>> blocked on I/O to that device should be able to continue running, so >>>>>>>>>>>> that process is only hurting itself. If the system fails to boot >>>>>>>>>>>> from >>>>>>>>>>>> another device due to this, this would likely point out some kind of >>>>>>>>>>>> problem in userspace or the distro boot process being overly >>>>>>>>>>>> serialized. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I have been booting up with the initramfs from ubuntu 13.04, >>>>>>>>>>> and I have also tried to boot with the ubuntu install cd. They >>>>>>>>>>> couldn't >>>>>>>>>>> continue the boot process. I'm gonna spend the weekend trying to >>>>>>>>>>> figure >>>>>>>>>>> out where and why the interrupts don't happen. Whether it be a >>>>>>>>>>> routing >>>>>>>>>>> or a hardware issue, which I highly doubt due to the fact that >>>>>>>>>>> Windows >>>>>>>>>>> XP SP2 was able to boot up without errors. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Are you able to get out full dmesg output from a boot attempt and the >>>>>>>>>> contents of /proc/interrupts? >>>>>>>>>> >>>>>>>>> As I said before, I am not able to get to the shell, without my >>>>>>>>> 'symptom >>>>>>>>> cure'. With my patch I get the following dmesg output, with >>>>>>>>> some of my debug messages turned off: >>>>>>>>> http://pastebin.com/5eb5G3Dx >>>>>>>>> /proc/interrupts is here: >>>>>>>>> http://pastebin.com/84CJey2D >>>>>>>>> After yesterday's research, I have come to ata_piix.c . That file looks >>>>>>>>> like >>>>>>>>> the real culprit, as my netbook's controller is an Intel ICH7M one, >>>>>>>>> The values I am getting from the device are very different than those >>>>>>>>> that are expected. >>>>>>>>> >>>>>>>>> Things I have noticed, but ignored in dmesg: >>>>>>>>> There is a stack dump, because nobody cared about IRQ#20. I have >>>>>>>>> ignored >>>>>>>>> this because it is the EHCI IRQ, and I suppose it has nothing to do >>>>>>>>> with >>>>>>>>> ata. The problem is with ata3 or /dev/sdc, while the IRQ happens >>>>>>>>> with /dev/sda, which works fine. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I think it is likely related to the problem. The kernel thinks this >>>>>>>> controller is on IRQ 16, but apparently something is raising >>>>>>>> un-acknowledged interrupts on IRQ 20 and nothing is coming in on IRQ >>>>>>>> 16. It seems quite likely that this is actually the ATA controller. >>>>>>>> >>>>>>>> You mentioned that Windows XP was able to work in this mode. I wonder >>>>>>>> if it was using the IOAPIC, as if not then the IRQ routing is >>>>>>>> different which might mask the problem. Do you know what IRQ Device >>>>>>>> Manager reported for this controller in Windows? And was it using any >>>>>>>> IRQs over 15 (which would indicate the IOAPIC was in use)? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hmm, according to WinXP's Device manager for this controller, >>>>>>> it listens to IRQ# 20, and therefore it is using the I/O APIC. >>>>>>> Now, one question remains where is the error that mismaps >>>>>>> controller? >>>>>>> I have created a simple patch which seems to fix this: >>>>>>> --- >>>>>>> @@ -1704,6 +1767,8 @@ static int piix_init_one(struct pci_dev *pdev, >>>>>>> const >>>>>>> struct pci_device_id *ent) >>>>>>> hpriv->map = piix_init_sata_map(pdev, port_info, >>>>>>> >>>>>>> piix_map_db_table[ent->driver_data]); >>>>>>> >>>>>>> + if(pdev->vendor == 0x8086 && pdev->device == 0x27C4) >>>>>>> + pdev->irq = 20; >>>>>>> rc = ata_pci_bmdma_prepare_host(pdev, ppi, &host); >>>>>>> if (rc) >>>>>>> return rc; >>>>>>> >>>>>>> However, I am more than sure that this is not the way >>>>>>> to solve this problem. Do you have any idea on where >>>>>>> the ideal place would be to implement a fix? >>>>>>> According to specs of ICH7M, which is essentially the >>>>>>> same as ICH6M, we need to check on what interrupt pin >>>>>>> is the SATA controller, and after that check which IRQ line >>>>>>> is connected to the I/O APIC and decide the IRQ's number >>>>>>> on those findings. >>>>>>> >>>>>>> Specs of ICH7: >>>>>>> >>>>>>> http://www.intel.com/content/dam/doc/datasheet/i-o-controller-hub-7-datasheet.pdf >>>>>>> Device 31 Interrupt Route Register: Chapter 7.1.46 >>>>>>> Device 31 Interrupt Pin Register: Chapter 7.1.41 >>>>>>> >>>>>>> The SATA controller is always Device 31. >>>>>> >>>>>> >>>>>> It would appear that something is messing up with the ACPI IRQ routing >>>>>> on this machine that's causing us to think the controller is on the >>>>>> wrong IRQ. CCing the linux-acpi list to see if anyone has some >>>>>> additional debugging suggestions. I suspect that dumping the DSDT is >>>>>> likely the first step though. If you can get IASL installed, you can >>>>>> do something like: >>>>>> >>>>>> cat /sys/firmware/acpi/tables/DSDT > dsdt.aml >>>>>> iasl -d dsdt.aml >>>>>> >>>>>> That should spit out a dsdt.dsl file which would hopefully have the >>>>>> info needed to figure out what's going on. >>>>>> >>>>> >>>>> Here is the disassembled DSDT table: >>>>> http://pastebin.com/LWNVht9H >>>>> The SATA controller is at line 5206. >>>>> I also disassembled the SSDT, but nothing interesting was there: >>>>> http://pastebin.com/fus5sxU8 >>>>> >>>>> I disabled the usage of ACPI for IRQs with acpi=noirq, >>>>> and it successfully booted up setting itself to IRQ#3. >>>>> This makes me think that this is the BIOS's fault. >>>>> I think it would be possible to create a DMI check >>>>> and forcibly set the irq to 20 if the DMI matches. >>>>> Any comments on this? >>>> >>>> The BIOS may be doing something funky, but since Windows apparently >>>> can figure out it's on IRQ 20, Linux presumably should be able to as >>>> well. DMI checks should be the last resort - Windows almost certainly >>>> doesn't have any machine-specific logic here, and it's hard to tell >>>> what other machine models could be affected. With ACPI stuff, we >>>> generally just need to do the same thing Windows does for things to >>>> work reliably, and DMI checks are more of a hack workaround than a >>>> real fix. >>>> >>>> I'll try and have a look at the DSDT within the next few days and see >>>> if I can figure anything out, unless someone beats me to it. >>> >>> I haven't gone into too much detail, but one thing I noticed with the >>> DSDT is that there appear to be some _OSI checks for Windows 2006 >>> (i.e. Vista) that seem to affect various things, including potentially >>> the PCI IRQ routing table. It's possible that their IRQ routing table >>> is broken for legacy mode with an ACPI OS supporting Vista (as current >>> Linux versions do). Could be this slipped through testing if they only >>> tested AHCI mode with Vista installed. >>> >>> You can try booting with the kernel parameters >>> >>> acpi_osi=! acpi_osi="Windows 2001 SP3" >>> >>> That should make the BIOS think we are Windows XP and bypass the Vista >>> code path. If that works, then you might want to check for a BIOS >>> update on this machine. >>> >> >> First of all, sorry for the late reply. I was kinda busy. >> >> I tried what you suggested but unfortunately the problem persists. >> This makes me believe that Windows XP does have somekind of DMI check here. >> Of course, while a BIOS update may solve this, I would prefer that Linux >> should also be able to boot up with this broken BIOS as well. >> >> If you are certain that WinXP doesn't use DMI checks, >> it could be that WinXP's driver of ICH7M's SATA controller applies >> a quirk and sets that irq line to #20. > > Can you post the dmesg output from a bootup attempt with those options? > > You may also want to try adding just: acpi_osi=! > None of the 3 possible combinations succeeded to boot. Here are a couple of dmesgs: Params: acpi_osi="Windows 2001 SP3" http://pastebin.com/vF3BSuhc Params: acpi_osi=! acpi_osi="Windows 2001 SP3" http://pastebin.com/BuUzc3es Params: acpi_osi=! http://pastebin.com/u7uRx8Ru -- Regards, Levente Kurusa -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html