On Sat, Sep 21, 2013 at 1:35 AM, Levente Kurusa <levex@xxxxxxxxx> wrote: >>>>>>> The following dmesg is stuck in an infinite loop. >>>>>>> dmesg: >>>>>>> ata3: lost interrupt (Status 0x50) >>>>>>> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen >>>>>>> ata3.00: failed command: READ DMA >>>>>>> ata3.00: cmd c8/00:08:00:00:00/00:00:00:00:00/e0 tag 0 dma 4096 in >>>>>>> res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 >>>>>>> (timeout) >>>>>>> ata3.00: status: { DRDY } >>>>>>> ata3: soft resetting link >>>>>>> ata3.00: configured for UDMA/33 (no error) >>>>>>> ata3.00: device reported invalid CHS sector 0 >>>>>>> ata3: EH complete >>>>>>> >>>>>>> Patch that fixes the infinite loop: >>>>>>> diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c >>>>>>> index f9476fb..eeedf80 100644 >>>>>>> --- a/drivers/ata/libata-eh.c >>>>>>> +++ b/drivers/ata/libata-eh.c >>>>>>> @@ -2437,6 +2437,14 @@ static void ata_eh_link_report(struct ata_link >>>>>>> *link) >>>>>>> ehc->i.action, frozen, tries_buf); >>>>>>> if (desc) >>>>>>> ata_dev_err(ehc->i.dev, "%s\n", desc); >>>>>>> + ehc->i.dev->exce_cnt ++; >>>>>>> + ata_dev_warn(ehc->i.dev, "Number of exceptions: >>>>>>> %d\n", >>>>>>> ehc->i.dev->exce_cnt); >>>>>>> + /** >>>>>>> + * The device is failing terribly, >>>>>>> + * disable it to prevent damage. >>>>>>> + */ >>>>>>> + if(ehc->i.dev->exce_cnt > 2) >>>>>>> + ata_dev_disable(ehc->i.dev); >>>>>>> } else { >>>>>>> ata_link_err(link, "exception Emask 0x%x " >>>>>>> "SAct 0x%x SErr 0x%x action >>>>>>> 0x%x%s%s\n", >>>>>>> diff --git a/include/linux/libata.h b/include/linux/libata.h >>>>>>> index eae7a05..fa52ee6 100644 >>>>>>> --- a/include/linux/libata.h >>>>>>> +++ b/include/linux/libata.h >>>>>>> @@ -660,7 +660,8 @@ struct ata_device { >>>>>>> u8 >>>>>>> devslp_timing[ATA_LOG_DEVSLP_SIZE]; >>>>>>> >>>>>>> /* error history */ >>>>>>> - int spdn_cnt; >>>>>>> + int spdn_cnt; /* Number of speed_downs */ >>>>>>> + int exce_cnt; /* Number of exceptions >>>>>>> that >>>>>>> happenned */ >>>>>>> /* ering is CLEAR_END, read comment above CLEAR_END */ >>>>>>> struct ata_ering ering; >>>>>>> }; >>>>>>> >>>>>> >>>>>> This doesn't seem like a very good fix. It may prevent the apparent >>>>>> infinite loop but will just prevent that device from functioning at >>>>>> all. >>>>>> It would be better if we could figure out what was actually going >>>>>> wrong. >>>>>> >>>>>> >>>>> I have tested the problem with three different computers, all switched >>>>> to legacy/IDE/compatibility mode, and they didn't have this problem. Of >>>>> course, they could have been set to AHCI mode, and there the kernel >>>>> would >>>>> boot normally. Feels strange, but so far I was only able to reproduce >>>>> the >>>>> problem with a Toshiba MK8052GSX. On the topic of my patch, I still >>>>> don't >>>>> see why a device which fails so terribly that it reports 3 exceptions >>>>> shouldn't be disabled. Like in this case, it could cause infinite >>>>> loops. >>>> >>>> >>>> >>>> The problem is that this could happen in some cases when you wouldn't >>>> want to disable the device, like an error that just happens >>>> sporadically and works on retry, or a device you're trying to recover >>>> data from. >>>> >>> What do you think if I edit the patch in a way, that when an operation >>> successfully completes, it resets exce_cnt to zero. Might as well add a >>> module_param, which can set the maximum value of exce_cnt, while having >>> zero >>> as an option to never disable the device. Please don't think me wrong, I >>> don't want to force this patch, I just want to learn how all this works, >>> and >>> in the process try to make it better. :-) >> >> >> That would be better, but I think you're still going to have an issue >> with what magic number to pick to avoid disabling devices >> inappropriately. >> >> Conceptually, disabling the device doesn't really make sense anyway. >> If someone in userspace wants to keep trying to read from that device, >> why would you stop them because of some arbitrary judgement? The >> kernel itself isn't "locked up" during this process, anything not >> blocked on I/O to that device should be able to continue running, so >> that process is only hurting itself. If the system fails to boot from >> another device due to this, this would likely point out some kind of >> problem in userspace or the distro boot process being overly >> serialized. >> > > I have been booting up with the initramfs from ubuntu 13.04, > and I have also tried to boot with the ubuntu install cd. They couldn't > continue the boot process. I'm gonna spend the weekend trying to figure > out where and why the interrupts don't happen. Whether it be a routing > or a hardware issue, which I highly doubt due to the fact that Windows > XP SP2 was able to boot up without errors. Are you able to get out full dmesg output from a boot attempt and the contents of /proc/interrupts? -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html