Re: No I/O errors reported after SATA link hard reset

Bernd Schubert <bernd.schubert@xxxxxxxxxxx> · Thu, 17 Aug 2017 15:43:10 +0200

On 08/17/2017 03:25 PM, Tejun Heo wrote:
> Hello,
> 
> On Thu, Aug 17, 2017 at 03:18:06PM +0200, Bernd Schubert wrote:
>> So for Gionatan the root cause was an instable power supply, but in my
>> case there wasn't any power loss, there were just failed sata commands.
>> I'm not sure if this was a port or cable issue - once I changed port and
>> sata cable the errors disappeared. I didn't change the power supply or
>> power cable. I'm now basically fighting with the data corruption that
>> caused - for btrfs it at least has a checksum, but I didn't have ext4
>> checksum enabled, so it is hard to figure out which files are corrupts -
>> silent data corruption is not well handled by backups either.
> 
> No idea there.  Retried and recovered errors shouldn't cause data
> corruptions.  Flaky power can behave in unexpected ways tho.  What
> happens if you hook up the drive on a different power supply but
> revert to the port / cable which showed the problem?  What does your
> SMART counters say across those failures?

Hmm, well, I think I through away the cable already, and I also don't
have spare power supplies at home. It also wasn't that easy to reproduce
the errors, they came up when my wife was working on her system - not
when I was controlling it ;)

> 
>> Is it possible that sata eh recovery sends resets to the device, which
>> makes it evict its cache?
> 
> That'd be a very broken device.  It sure is theoretically possible but
> I haven't seen any reports on such behaviors yet.

I wonder if we just couldn't make the error handler to report issues for
people who are running raid. Gionatans powerloss and my unclear
corruption issue probably wouldn't have happened if the upper md layer
would have gotten an information that it should report errors instead of
recovering them. Although I admit it is a difficult decision what to
with link glitches.

Thanks,
Bernd