Re: Need to understand error messages

Hannes Reinecke <hare@xxxxxxx> · Fri, 7 Mar 2025 08:16:20 +0100

On 3/7/25 05:20, Eyal Lebedinsky wrote:
On 7/3/25 12:27, Roger Heflin wrote:
That is the report uncorrectable error coming back to the OS.   ie
sense key: medium error.

It looks like you had a few commands lined up (the tags) and one io
hung (2888) and eventually failed (bad sector) but it took long enough
that  is timed out on all of the other IO behind it (the SOFT_ERROR).

The scsi layer should have retried the SOFT ones I would think.

You might want to check to see what smartctl -l scterc says the disks
timeout is and what the OS level scsi timeout is.  I set the disk
timeouts as low as the disk will allow and leave my OS timeouts
default (30 sec typically).

SCT Error Recovery Control:
            Read:     70 (7.0 seconds)
           Write:     70 (7.0 seconds)

I would have thought there would be a md rewrite.

I also thought so. The fact that I now see 48 Reallocated_Sector_Ct 
suggests that there were
writes to the failed sectors, since a failed read adds a Pending then 
the write leads to Reallocation.
Now Current_Pending_Sector is zero.

Also, 48 reallocated is more than the one failed sector the disk sensed,
and the following timed out tags is something the OS saw (and the disk 
should not reallocate?).

The MPT hardware has a very poor queueing implementation. It exposes a 
SCSI host with literally thousands of commands, but the component drives
only have a queue depth of 31. So there is a mismatch, and there are
issues when a long-running command (eg a command triggering error 
handling) will block the pending commands already queued within the
firmware.
In these cases the offending command will cause the _pending_ commands
to timeout, even though the would probably be perfectly fine if they
hadn't been blocked. And returning 'QUEUE_FULL' status would be too
easy...

Anyway.
Fact is, your drive developed read errors. And experience shows that
a read error is the beginning of the end for a drive.
So I would recommend to not investigate further but rather get a new
drive.

Cheers,

Hannes
--
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@xxxxxxx                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich