Re: RAID1 scrub ignoring read errors?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/12/18 10:33 pm, Niklas Hambüchen wrote:
On 2018-12-04 01:27, Brad Campbell wrote:
Try running a read on the disk with :
dd if=/dev/sdX of=/dev/null bs=1M conv=noerror

Hey Brad, thanks for your reply!

I first tried reading only around the first problematic sector 1758544.
First the one directly before it:

   # dd bs=512 if=/dev/sdb of=/dev/null skip=1758543 count=1
   1+0 records in
   1+0 records out
   512 bytes copied, 0,00713634 s, 71,7 kB/s

Now the problematic sector:

   # dd bs=512 if=/dev/sdb of=/dev/null skip=1758544 count=1
   dd: error reading '/dev/sdb': Input/output error
   0+0 records in
   0+0 records out
   0 bytes copied, 7,00467 s, 0,0 kB/s

Error after 7 seconds, seems like timeouts are working as expected.
After I did so, I got in smartctl:

   ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   ...
   197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1

So that seems to work as expected.
Why did it not increase when the RAID1 scrub had the read failures though?

That is puzzling, but if I've learned one thing about drives and SMART, it's that implementations are inconsistent from manufacturer, drive family and even firmware versions. You just can't rely on it.

Puzzling also as to why md didn't re-write that sector when it found a read error. I have it do that from time to time on RAID-6.


I am now running the dd you suggested on the whole disk, which will take a couple hours.

That'll just highlight any other duff sectors that might be after the one that triggers the SMART test failure.

Recovery:

Also I'd like to ask what my recovery strategy should be.
My current understanding is that some sectors are unreadable on sda and some unreadable on sdb.
As per explanations so far, these can be fixed by re-writing from the corresponding other devices.
Now, sda seems to be truly broken, given that the RAID scrub reported that the write failed.

Yeah, a write error isn't good. I'd be replacing that drive yesterday.

This means that if I replace sda by a new disk first, I will not be able to recover unreadable sectors on sdb (via copies from sda, because it'd be gone).

Ideally I would be able to first fix all unreadable sectors on sdb by copying the relevant sectors from sda.
But I don't know if that's possible, because it seems the scrub stops at the first write error to sdb.

What should I do?

Personally (and granting that my methods are most likely less than optimal)?

If you are serious about replacing your drives (or sda at least), I'd get a third disk, create a new RAID-1 from the new disk with one drive missing, copy the data from the old RAID to the new RAID and then add the old sdb to it. I'd be inclined to write zeros to the entire drive first to force a reallocation on any pending sectors, even though the RAID rebuild will do most of the disk anyway.

If you are serious about keeping your redundancy, then two new drives into a new RAID-1 and copy the data.

Drives are cheap. Backups are cheap. Data recovery is expensive.

Regards,
Brad



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux