Il giorno sab 14 mag 2022 alle ore 15:46 Wols Lists <antlists@xxxxxxxxxxxxxxx> ha scritto: > Correct. If the underlying disk returns an error, raid recovery kicks > in. The missing block is calculated, returned to the caller and written > back to the disk. but in this case i would expect md to log something somewhere, not a total silence. > The error message is "critical medium error" - we have a real problem > with the disk I suspect. > > FIRST run SMART on the disk and see what that reports. If that's not > happy, REPLACE THE DRIVE PRONTO. > > If SMART is happy, run a raid scrub. When this happens, i'll replace drives ASAP, it doesn't matter if it's a transient failure or similar. A working disk, for me, is a disk that NEVER returns any kind of issue. Usually I replace disks even when there is a single recovered sector. > And whatever, if you haven't replaced the drive, start monitoring SMART. > If disk errors start climbing, that's a cause for concern and replacing > the drive. All disks are under smart monitoring with both short and long tests (weekly) and also weekly (or monthly? I don't remember) md consistency check Anyway, as our new servers has some free slots (we keep free slots with intentions) out replacements doesn't mean to remove the old drive (loosing part of redundancy) and then adding a new one, but we always use a replace: mdadm /dev/md0 --add /dev/NEW --replace /dev/OLD --with /dev/NEW it's MUCH safer, but what happens in case of /dev/OLD failure during the replacement ? the rebuild will be done reading from other drivers transparently ? And normally, reads are done FROM old in this case or from the full array ?