Re: fixed a bug of adma in rhel4u5 with HDS7250SASUN500G.

Robert Hancock <hancockr@xxxxxxx> · Sun, 13 Jan 2008 23:20:13 -0600

Kuan Luo wrote:
Robert hancock wrote:
What problem does this resolve? I tested it against the cache 
flush/NCQ 
write switching problem we've been trying to solve, and it 
doesn't look 
like it fixes that one - if I apply this patch and then remove the 
udelay(20) in sata_nv.c that I added which prevented me from 
seeing this 
problem before, it shows up.

First thank  davide to help to send the attachment.

Robert,
The patch is to solve the error message "ata1: CPB flags CMD err,
flags=0x11" when testing HDS7250SASUN500G in rhel4u5.
I tested this hd in 2.6.24-rc7 which needed to remove the mask in
blacklist to run the ncq and the same error also showed up. 

I traced the  bug and found that the interrupt finished a command (for
example, tag=0) when the driver got that adma status is
NV_ADMA_STAT_DONE  and  cpb->resp_flags is NV_CPB_RESP_DONE.
However, For this hd, the drive maybe didn't clear bit 0 at this moment.
It meaned the hardware  had not completely finished the command.
If at the same time  the driver freed the command(tag 0) and sended
another command (tag 0), the error happened.

The notifier register is 32-bit register containing notifier value.
Value is bit vector containing one bit per tag number (0-31) in
corresponding bit positions (bit 0 is for tag 0, etc). When bit is set
then ADMA indicates that command with corresponding tag number completed
execution.

So i added the check notifier code. Sometimes i saw that the notifier
reg set some bits  , but the adma status set NV_ADMA_STAT_CMD_COMPLETE
,not NV_ADMA_STAT_DONE. So i added the NV_ADMA_STAT_CMD_COMPLETE check
code.

That looks like a good fix then. (Though a possible optimization would 
be to and the check_commands value with the notifier clear value rather 
than testing against the notifier on each loop. That's fairly minor though.)

As I mentioned, this doesn't seem to resolve the problem we're seeing 
with rapidly intermixed NCQ commands and cache flushes (at least, if I 
take out the arbitrary 20usec delay from the driver and add this patch, 
the problem still shows up). It could be a similar problem, though, of 
commands being issued before the controller is really ready for them. If 
you or others at NVIDIA could assist in tracking down that problem it 
would be appreciated..
-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html