Re: HDD Unrecovered readerror issue

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Fri, 22 Jul 2016 07:33:59 -0700

On Fri, 2016-07-22 at 09:51 -0400, Jeff Moyer wrote:
> Dmitry Monakhov <dmonakhov@xxxxxxxxxx> writes:
> 
> > But once I rewrite this block, problem goes away.
> > #xfs_io -c "pwrite -S 0x0 $((80069000/2))k 4k" -d  /dev/sda
> > 
> > Now I can read it w/o any errors and smartctl is happy
> > #smartctl -t short /dev/sda
> > #smartctl -l selftest /dev/sda
> > Num  Test_Description    Status                  Remaining
> > LifeTime(hours)  LBA_of_first_error
> > # 1  Short offline       Completed without error       00%     
> >  4683 -
> > 
> > So my disk is not dead right?
> 
> Correct.
> 
> > Why the hell HDD fail read from very beginning
> > Is this because HDD firmware detect internal crcXX sum corruption?
> 
> Yes.
> 
> > How this can happen? Is this because of power failure?
> 
> Could be.  If power was cut in the middle of a write, this can 
> happen. There are other causes, though (bit rot, for example).
> 
> > AFAIK standard guarantees that sector will be updated atomically.
> 
> No, the SCSI and ATA standards most certainly do not guarantee that!
> NVMe is the only standard I know of that requires Atomic Write Unit
> Power Fail to be at lest one sector.

The mechanics of the drive mostly ensure atomic updates on the physical
block level.  You definitely get either the old data, the new data or
an unreadable sector.  The latter is a pretty rare event because
surviving power usually ensures the writes complete, but it's not
guaranteed. 

> > But it happens! Please guide me how to fix such problems in
> > general.
> 
> You fixed it.  Overwriting the sector will clear the error.

Actually only "may clear the error" depending on what happened.  If the
hamming codes on the sector itself just failed (because of a torn write
due to power fail) then a rewrite simply re-fixes the sector in situ. 
 Sometimes the magnetic substrate of the track is worn (so the sector
is permanently damaged) and the re-write forces a reallocation.  If
that's happening to your disk then eventually it will fail
irrecoverably when the reallocation table is full.

You can monitor this with the smart Reallocated_Event_Count.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html