On 06/11/2018 02:40 PM, James Bottomley wrote:
On Mon, 2018-06-11 at 12:20 -0400, Douglas Gilbert wrote:
I have also seen Aborted Command sense when doing heavy testing on
one or more SAS disks behind a SAS expander. I put it down to a
temporary lack of paths available (on the link between the host's HBA
and the expander) when one of those SAS disks tries to get a
connection back to the host with the data (data-in transfer) from an
earlier READ command.
In my code (ddpt and sg_dd) I treat it as a "retry" type error and in
my experience that works. IOW a follow-up READ with the same
parameters is successful.
We do treat ABORTED_COMMAND as a retry. However, it will tick down the
retry count (usually 3) and then fail if it still occurs. How long
does this condition persist for? because if it's long lived we could
treat it as ADD_TO_MLQUEUE which would mean we'd retry until the
timeout condition was reached.
On my system, it's a bit hard to tell, as as soon as ZFS sees the read
error, it starts resilvering to repair the sector that reported the I/O
error. Without the scrub, it happened once over a 5-day window. During
the scrub, it was usually 10s of minutes between occurrences that failed
all the retries, but I had some occasions where it happened about 5-10
minutes apart. It definitely seems to be load-related, so how long and
hard the load stays elevated is a factor.
--Ted