On 5/27/2013 8:32 PM, Baruch Even wrote: > necessary but the command itself if it is already actively handled > continues in its path. The abort only cancels those commands that are in > the queue and if there really was a problem and the disk is engaging in > error recovery of its own you'll just have no response from it and it will > seem dead (abort may timeout). Yes, the abort seems to be handled more like a "hint" in many cases. Having coded a couple targets, abort handling is often _REALLY_ hard to get 100% right. Especially, when its an actual error that is causing the delay, rather than a correctly functional long running command. That said, I've seen devices actually respond to aborts on tape ERASE and similar commands by actually aborting the command as one would expect. So it does sometimes work.. Besides abort timeouts (which is major bad karma) the abort may be accepted, and the next non inquiry/tur type command that gets queued simply blocks waiting for the abort to internally complete. From the target device perspective, if you don't send a response for ABTS out in 2*RA_TOV then your problems start to multiply. So it encourages the target devices to treat aborts in an async manner. As you said, the device simply finds the indicated command on a queue, marks it as being aborted and hopes whatever is processing the command notices and terminates its operation. On subsequent commands the nicer devices will notice the abort hasn't completed and return becoming ready or similar in response to TUR/etc for some number of minutes. > > This view of aborts also means that reducing timeouts for commands and TMFs > is mostly useless and sometimes even a really bad idea. I prefer to just > let the device go on with its error recovery and just forget about the > command. I want to forget about the DMA so I issue an abort but anything > higher than that means a link is dead to me. Well, invariably the manufactures have timeouts that are really long and based on internal error recovery logic. See http://www-01.ibm.com/support/docview.wss?uid=ssg1S7003556&aid=1 page 468. Notice the timeouts are specified in minutes, not seconds. Furthermore, the commands that normally complete in fractions of a second have actual timeouts that can be tens of minutes (READ/WRITE for example). So, doing anything before that timeout has expired is a good way to knock the device offline. Some of the newer disks have mode page options to shorten their read/write error recovery, but "short" error recovery can still be many tens of seconds rather than a couple minutes. Plus, it doesn't help compound commands like "SYNCHRONIZE CACHE" which may take multiple errors during operation. This is another part of what formed my opinions about error isolation. If one of your devices goes out to lunch and isn't recovering via abort/lun reset. Its done! Wrecking the rest of the SAN doing "bus resets" and HBA resets is a good way to take a serious problem and turn it into a full blown catastrophe. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html