Re: RAID1 scrub ignoring read errors?

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Fri, 7 Dec 2018 11:13:48 +1100

On 7/12/18 3:06 am, Brad Campbell wrote:
On 6/12/18 10:33 pm, Niklas Hambüchen wrote:
On 2018-12-04 01:27, Brad Campbell wrote:
Try running a read on the disk with :
dd if=/dev/sdX of=/dev/null bs=1M conv=noerror

Hey Brad, thanks for your reply!

I first tried reading only around the first problematic sector 1758544.
First the one directly before it:

   # dd bs=512 if=/dev/sdb of=/dev/null skip=1758543 count=1
   1+0 records in
   1+0 records out
   512 bytes copied, 0,00713634 s, 71,7 kB/s

Now the problematic sector:

   # dd bs=512 if=/dev/sdb of=/dev/null skip=1758544 count=1
   dd: error reading '/dev/sdb': Input/output error
   0+0 records in
   0+0 records out
   0 bytes copied, 7,00467 s, 0,0 kB/s

Error after 7 seconds, seems like timeouts are working as expected.
After I did so, I got in smartctl:

   ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      
UPDATED  WHEN_FAILED RAW_VALUE
   ...
   197 Current_Pending_Sector  0x0032   200   200   000 Old_age   
Always       -       1

So that seems to work as expected.
Why did it not increase when the RAID1 scrub had the read failures 
though?

That is puzzling, but if I've learned one thing about drives and 
SMART, it's that implementations are inconsistent from manufacturer, 
drive family and even firmware versions. You just can't rely on it.

Puzzling also as to why md didn't re-write that sector when it found a 
read error. I have it do that from time to time on RAID-6.

I am now running the dd you suggested on the whole disk, which will 
take a couple hours.

That'll just highlight any other duff sectors that might be after the 
one that triggers the SMART test failure.

Recovery:

Also I'd like to ask what my recovery strategy should be.
My current understanding is that some sectors are unreadable on sda 
and some unreadable on sdb.
As per explanations so far, these can be fixed by re-writing from the 
corresponding other devices.
Now, sda seems to be truly broken, given that the RAID scrub reported 
that the write failed.

Yeah, a write error isn't good. I'd be replacing that drive yesterday.

This means that if I replace sda by a new disk first, I will not be 
able to recover unreadable sectors on sdb (via copies from sda, 
because it'd be gone).

Ideally I would be able to first fix all unreadable sectors on sdb by 
copying the relevant sectors from sda.
But I don't know if that's possible, because it seems the scrub stops 
at the first write error to sdb.

What should I do?

Personally (and granting that my methods are most likely less than 
optimal)?

If you are serious about replacing your drives (or sda at least), I'd 
get a third disk, create a new RAID-1 from the new disk with one drive 
missing, copy the data from the old RAID to the new RAID and then add 
the old sdb to it. I'd be inclined to write zeros to the entire drive 
first to force a reallocation on any pending sectors, even though the 
RAID rebuild will do most of the disk anyway.

Wouldn't a mdadm replace solve this?

https://unix.stackexchange.com/questions/74924/how-to-safely-replace-a-not-yet-failed-disk-in-a-linux-raid5-array

The system will copy all readable blocks from |sdd1| to |sdc1|. If it 
comes to an unreadable block, it will reconstruct it from parity. Once 
the operation is complete, the former spare (here: |sdc1|) will become 
active, and the failing drive will be marked as failed (F) so you can 
remove it.

Which sounds exactly what you want to do...

If you are serious about keeping your redundancy, then two new drives 
into a new RAID-1 and copy the data.

Drives are cheap. Backups are cheap. Data recovery is expensive. 

Agreed!

Regards,
Adam

--
Adam Goryachev Website Managers www.websitemanagers.com.au
--
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful. If you have received this message
in error, please notify us immediately. Please also destroy and delete the
message from your computer. Viruses - Any loss/damage incurred by receiving
this email is not the sender's responsibility.