Re: [2.6.18,19] SATA boot problems (ICH6/ICH6W)

Gary Hade <garyhade@xxxxxxxxxx> · Thu, 21 Dec 2006 09:10:35 -0800

On Wed, Dec 20, 2006 at 12:53:57PM +0900, Tejun Heo wrote:
> Howdy,
> 
> Gary Hade wrote:
> > I noticed that Tejun recently provided a "libata: handle 0xff status
> > properly" patch that is now in mainline that improves this code
> >   re: http://marc.theaimsgroup.com/?l=linux-ide&m=116038642105802&w=2
> > but I found that the check still failed but more silently and with no
> > retries.
> >
> > I decided to try increasing the delay that preceeds the above
> > check [ msleep(150); ] and found that a change from 150ms to
> > 1000ms caused the problem to disappear.
> 
> Aieeee, 150ms not enough for the device to send the first FIS after SRST?

Yea, it appears so. :)   

GoVault access via 'ahci' is also fails in some cable placement
configurations with:
  kernel: scsi1 : ahci
  kernel: ata2: softreset failed (1st FIS failed)
  kernel: ata2: softreset failed, retrying in 5 secs
  kernel: ata2: port is slow to respond, please be patient
  kernel: ata2: port failed to respond (30 secs)
  kernel: ata2: COMRESET failed (device not ready)
  kernel: ata2: hardreset failed, retrying in 5 secs
  kernel: ata2: port is slow to respond, please be patient
  kernel: ata2: port failed to respond (30 secs)
  kdump: kexec: failed to load kdump kernel
  kernel: ata2: COMRESET failed (device not ready)
  kernel: ata2: reset failed, giving up

This problem also disappears after reversing ports to which 
the hard drive and GoVault cables are connected.

The following timeout increase appears to correct the 'ahci' problem:

--- ./linux-2.6.18.i386/drivers/scsi/ahci.c.orig	2006-12-19 09:07:58.000000000 -0800
+++ ./linux-2.6.18.i386/drivers/scsi/ahci.c	2006-12-19 13:30:29.000000000 -0800
@@ -788,7 +788,7 @@ static int ahci_softreset(struct ata_por
 
 	writel(1, port_mmio + PORT_CMD_ISSUE);
 
-	tmp = ata_wait_register(port_mmio + PORT_CMD_ISSUE, 0x1, 0x1, 1, 500);
+	tmp = ata_wait_register(port_mmio + PORT_CMD_ISSUE, 0x1, 0x1, 1, 2500);
 	if (tmp & 0x1) {
 		rc = -EIO;
 		reason = "1st FIS failed";

1000ms, 1500ms, 1750ms, and 1900ms didn't work.  2000ms worked so
2500ms includes some extra to be safe.  This experience seems to 
be more representative of the 1 to 2 second time (with RDC present) 
mentioned by Quantum (see below) than the 'ata_piix' 600-700ms 
experience.

> 
> > I then replaced the msleep(150); with:
> >     {
> >         int i, ms = 5;
> >         msleep(ms);
> >         ata_port_printk(ap, KERN_INFO, "status @ %d ms: 0x%x\n",
> >                                         ms, ata_check_status(ap));
> >         for (i = 1; i <= 20; i++) {
> >             ms += 50;
> >             msleep(50);
> >             ata_port_printk(ap, KERN_INFO, "status @ %d ms: 0x%x\n",
> >                                             ms, ata_check_status(ap));
> >         }
> >     }
> >
> > Output for two cable placement configurations (0xFF check failure
> > and 0xFF check success) are included below.  Note that there are
> > cable placement configurations for both the hard drive and
> > GoVault where the initial status is 0xff. i.e. both transition
> > from 0xff to 0x7f when BSY bit is cleared but it is taking MUCH
> > longer for the GoVault (600-700ms for GoVault and <5ms for
> > hard drive).  It does not appear that the 0xff starting status
> > is device specific.
> >
> > So, it appears that we have a situation with this SATA controller
> > where a 0xFF status is not an accurate indication that there is
> > no device.
> >
> > Although the 150ms to 1000ms delay increase works for the GoVault
> > device I am not sure if it is the best long term fix for the problem.
> 
> I would be surprised if Kovid's sda not detected case is caused by this.
>  For GoVault (that's SATAPI right?), yeah, maybe.  

Yes, the GoVault is an ATAPI device.

> For an ATA disk, no way (hopefully).

Yes, probably true that Kovid got the same errors but for a
different reason.

> 
> Can you consult with quantum about it?  

I checked with Quantum about this and they said:
---
"We confirmed that if there's an RDC present when the soft reset is
 received, then it can take between one and two seconds to complete the
 reset.  Issuing a SET FEATURES command to the RDC is the longest part of
 it.

 Even without an RDC, we've measured time on the order of 170
 milliseconds. "
---

The RDC has been present for almost all of my testing.  Here
are comparison traces with and without the RDC which definitely 
confirms the RDC factor.  It also confirms the order of 170ms
without RDC time that Quantum mentioned.

========
With RDC
========
kernel: ata1: status @ 5 ms: 0xff
kernel: ata1: status @ 55 ms: 0xff
kernel: ata1: status @ 105 ms: 0xff
kernel: ata1: status @ 155 ms: 0xff
kernel: ata1: status @ 205 ms: 0xff
kernel: ata1: status @ 255 ms: 0xff
kernel: ata1: status @ 305 ms: 0xff
kernel: ata1: status @ 355 ms: 0xff
kernel: ata1: status @ 405 ms: 0xff
kernel: ata1: status @ 455 ms: 0xff
kernel: ata1: status @ 505 ms: 0xff
kernel: ata1: status @ 555 ms: 0xff
kernel: ata1: status @ 605 ms: 0xff
kernel: ata1: status @ 655 ms: 0x7f
kernel: ata1: status @ 705 ms: 0x7f
kernel: ata1: status @ 755 ms: 0x7f
kernel: ata1: status @ 805 ms: 0x7f
kernel: ata1: status @ 855 ms: 0x7f
kernel: ata1: status @ 905 ms: 0x7f
kernel: ata1: status @ 955 ms: 0x7f
kernel: ata1: status @ 1005 ms: 0x7f

===========
Without RDC
===========
kernel: ata1: status @ 5 ms: 0xff
kernel: ata1: status @ 55 ms: 0xff
kernel: ata1: status @ 105 ms: 0xff
kernel: ata1: status @ 155 ms: 0xff
kernel: ata1: status @ 205 ms: 0x7f
kernel: ata1: status @ 255 ms: 0x7f
kernel: ata1: status @ 305 ms: 0x7f
kernel: ata1: status @ 355 ms: 0x7f
kernel: ata1: status @ 405 ms: 0x7f
kernel: ata1: status @ 455 ms: 0x7f
kernel: ata1: status @ 505 ms: 0x7f
kernel: ata1: status @ 555 ms: 0x7f
kernel: ata1: status @ 605 ms: 0x7f
kernel: ata1: status @ 655 ms: 0x7f
kernel: ata1: status @ 705 ms: 0x7f
kernel: ata1: status @ 755 ms: 0x7f
kernel: ata1: status @ 805 ms: 0x7f
kernel: ata1: status @ 855 ms: 0x7f
kernel: ata1: status @ 905 ms: 0x7f
kernel: ata1: status @ 955 ms: 0x7f
kernel: ata1: status @ 1005 ms: 0x7f

> If they verify your fix (ie,
> GoVault sometimes take more than 150ms to transmit the first D2H Reg FIs
> after SRST), I'll push similar patch upstream.

Thanks.  If you think that changes to increase the delays are 
the way to go (at least until we can find a better solution) 
I can provide patches.

> 
> Hmm.. or do we have to wait !BSY here as old IDE did?

Not sure.  I'm fairly new to this stuff.

Thanks!

Gary

-- 
Gary Hade
IBM Linux Technology Center
503-578-4503  IBM T/L: 775-4503
garyhade@xxxxxxxxxx
http://www.ibm.com/linux/ltc

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html