On 06/11/2018 04:08 PM, James Bottomley wrote:
On Mon, 2018-06-11 at 14:59 -0700, Ted Cabeen wrote:
On 06/11/2018 02:40 PM, James Bottomley wrote:
On Mon, 2018-06-11 at 12:20 -0400, Douglas Gilbert wrote:
I have also seen Aborted Command sense when doing heavy testing
on one or more SAS disks behind a SAS expander. I put it down to
a temporary lack of paths available (on the link between the
host's HBA and the expander) when one of those SAS disks tries to
get a connection back to the host with the data (data-in
transfer) from an earlier READ command.
In my code (ddpt and sg_dd) I treat it as a "retry" type error
and in my experience that works. IOW a follow-up READ with the
same parameters is successful.
We do treat ABORTED_COMMAND as a retry. However, it will tick down
the retry count (usually 3) and then fail if it still occurs. How
long does this condition persist for? because if it's long lived we
could treat it as ADD_TO_MLQUEUE which would mean we'd retry until
the timeout condition was reached.
On my system, it's a bit hard to tell, as as soon as ZFS sees the
read error, it starts resilvering to repair the sector that reported
the I/O error. Without the scrub, it happened once over a 5-day
window. During the scrub, it was usually 10s of minutes between
occurrences that failed all the retries, but I had some occasions
where it happened about 5-10 minutes apart. It definitely seems to
be load-related, so how long and hard the load stays elevated is a
factor.
OK, try this: it will print a rate limited warning if it triggers
(showing it is this problem) and return ADD_TO_MLQUEUE for all the SAS
errors (we'll likely narrow this if it works, but for now let's do the
lot).
I replaced the HBA in this system with a new one, and the problem
resolved, so this was an intermittent hardware issue, and not
software-related. Thanks for digging in with me, it helped a lot to
fully understand the software side.
--Ted