Read error in superblock not handled well by MD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello all,

recently in a server we had what seemed to be a sporadic corruption during write, and that happened during the write of the superblock and the bitmap on one disk of an array.

This corruption resulted in two consecutive 4k sectors (superblock, bitmap) being unreadable on a disk which was otherwise good. The array was a raid1 with 2 disks. The disks are of model WDC WD60EFRX-68MYMN1. We realized about that error due to SMART long tests, because MD/mdadm would not tell us anything.
Trying to read with dd, we could confirm the on-disk problem (read error).
Also mdadm --examine and --examine-bitmap could obviously not read any valid data from there

After this episode, MD didn't behave well IMHO.

During array checks the error was not reported and the superblock and the bitmap on that disk would never be rewritten; during event count changes the superblock on that disk was never rewritten (it was written on the other disk of the array), and during writes to the array, the bitmap of that disk was never rewritten (it was written on the other disk of the array). The array stayed up otherwise, but had we restarted the server, it would have restarted with 1 disk only.

We waited days to see if the problem would resolve on its own but it wouldn't.
Then we went in and used dd to overwrite those two 4k sectors with zeroes.
The disk was good so this solved the read error problem instantly and at the first attempt.

After a very short time, less than 2 minutes, MD restarted rewriting those sectors so we again had a good superblock and good bitmap on the previously-bad disk.

So I suppose what MD does is: before updating the superblock and/or the bitmap, MD tries to read such sectors. If it encounters a read error it refrains from rewriting such sectors, however reading zeroes (a clearly invalid value) is apparently fine.

I'm not sure of why the algorithm is like this, but it prevents to fix a disk surface problem / read error on disks in the superblock and/or bitmap areas, and those are not fixed even during check/repair actions for the array.

I propose that MD should write those sectors without attempting to read them first.

Thank you
N.Br. (prefer not to be acknowledged for this bug report or fix)




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux