Robert Hancock wrote: >> This is kind of a longstanding problem which has been partially worked >> around, but it seems not entirely. This is what I had diagnosed some >> time ago: >> >> "recently, some issues cropped up with command timeouts when a cache >> flush command was immediately followed by an NCQ write. In this case, >> sometimes when the NCQ write was issued, the status register changed >> from 0x500 (Stopped and Idle) to 0x400 (Stopped) as it normally >> appears to, however it seems like the controller would get hung in >> that state, and we would time out with no notifiers set, the gen_ctl >> register not indicating interrupt status, and the CPB response flags >> still 0 as we left them, seemingly indicating the controller hasn't >> done anything with it. Then, when the error handler kicks in we clear >> the GO bit to put it back into register mode, but the Legacy flag in >> the status register doesn't get set (or at least it takes longer than >> 1 microsecond). Finally when we do an ADMA channel reset that seems to >> get it responding again, until this happens the next time. >> >> From some experimentation, I found that when we are issuing a NCQ >> command when the last command was non-NCQ, or vice versa, if I added in >> a delay of 20 microseconds between setting up the CPB and writing to the >> append register, the problem appeared to go away. Problem is I don't >> know if that's because it actually needs this delay, or because it >> changes the timing so that it happens to work even though we're doing >> something wrong, there's some event we're not waiting for, etc. >> >> I've now verified that no switches between ADMA and register mode >> occur near the time of these timeouts. Neither are we reading or >> writing any of the ATA shadow registers while we're in ADMA mode." >> >> It seems likely that this is what is happening here (a switch from an >> NCQ command to a non-NCQ command, then the non-NCQ times out). It >> could be in some cases the 20 microsecond delay is not enough. But it >> seems bogus that we should need such an arbitrary delay in the first >> place. >> >> The question I had for NVIDIA regarding this that I never got answered >> was, is there any reason why we would need a delay when switching >> between NCQ and non-NCQ commands on ADMA, and if not, is there any >> known cause that could cause the controller to get into this seemingly >> locked-up state? > > Well, I guess I did sort of get an answer, but the only theory was that > the flush and the NCQ commands were being overlapped, which shouldn't be > possible (the libata core guarantees that, and if it didn't work it > would affect all controllers). > > I'm kind of wondering if there's something funny going on with the > notifier register stuff, which is supposed to tell us what commands have > completed. We don't really use it at all (we had some problems with > missed completions, etc. when I tried using it, also it doesn't work if > ATAPI is enabled on the other port on the controller, apparently). I > know these controllers will do strange things like not signalling > interrupts for later events if you don't clear the notifiers in just the > right way (that being mostly determined by trial and error). > > Or, maybe somehow the flush is getting issued before the controller is > really "ready" for it somehow (it's not finished cleaning up after > preceding NCQ command). > > It's pretty hard for me to figure out which of the above might be the > case, especially without access to the detailed controller documentation.. Thanks a lot for the detailed explanation. Nvidia ppl, any ideas? FLUSH is used regularly. We really need to fix this. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html