Re: [RFT] hpt366: reset DMA state machine on timeouts

Sergei Shtylyov <sshtylyov@xxxxxxxxxxxxx> · Fri, 22 Jun 2007 19:32:44 +0400

Hello.

Linas Vepstas wrote:

Reset HPT36x's DMA state machine on a DMA timeout the way it's done for HPT370.

Signed-off-by: Sergei Shtylyov <sshtylyov@xxxxxxxxxxxxx>

---
Linas, here's what I've come up with -- this should apply against 2.6.21.y.
Compile-tested only, not for merging.

drivers/ide/pci/hpt366.c |   24 +++++++++++++++++++++++-

This worked great!  The patch is good. But it raises another interesting
issue, one of those akpm ZFS "voilates boundaries" isses.

However.. When raid goes to reconstruct the partition, I get one
of the Drive Ready Seek Complete etc. messages.  Your handler recovers 

   I hope you meant those messages were preceeded by DMA timeouts (otherwise 
this code wouldn't come into action).

from it (I put in a printk to verify this).

   You mean into my ide_dma_timeout() method?

And so these printk's
try to get logged into /var/log/messages ... which trigger more 
errors. At a very high rate ... sometimes hundreds a second, sometimes
less.  The system remains usable, but at one point, it hit 60% cpu usage
spewing these messages to the screen.  

   Hm...

I'd like to see several things.

1) This patch should go in.  It converts a system that hangs into
   one that doesn't hang.

   What's strange is that it never seemed to be necessary before your great 
new drive... ;-)
   So, providing its data certainly wouldn't hurt -- perhaps we just should 
blacklist it instead -- maybe there's a UDMA speed at which this wouldn't 
happen, and we could just limit the drive to it.

2) There needs to be a way of failing the disk when there's a high
   number of errors. e.g. if there are more than 100 errors per minute
   then the disk needs to be marked "failed" in the raid array.

   Note it should be stopped only if the rate is high: if there is 
   only 1 error per minte, this might be very annoying, but acceptable,
   esp. if one is just trying to copy data off the disk.

   I'm not sure what to do if this had been the only disk in the system.
   Maybe if the eror reate exceed 100/minute, then dma is turned off 
   permanently?

   In fact, it should be turned off after 3 DMA errors (causing PIO retries).

--linas

MBR, Sergei
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html