md failing mechanism

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Greetings,

Recently, I've had my first drive failure in a software RAID1 on a file server. And I was really surprised about exactly what happened; I always thought that when md can't process a read request from one of the drives, it is supposed to mark that drive as faulty and read from another drive; but, for some reason, it was deliberately trying to read from a faulty drive no matter what, which apparently caused Samba to wait until it's finished, and so the whole server was rendered inaccessible (I mean, the whole Samba).


What I expected:
- A user tries to read a file via Samba.
- Samba issues a read request to md.
- md tries to read the file from one of the drives... the drive is struggling to read a bad sector... - md thinks: okay, this is taking too long, production is not waiting; I'll just read from another drive instead.
- It reads from another drive successfully, and users continue their work.
- Finally, the "bad" drive gives up on trying to read the bad sector and returns an error. md marks the drive as faulty and sends an email telling me to replace the drive as soon as possible.


What happened instead:
- A user tries to read a file via Samba.
- Samba issues a read request to md.
- md tries to read the file from one of the drives... the drive is struggling to read a bad sector... Samba is waiting for md, md is waiting for the drive, and the drive is trying again and again to read this blasted sector like its life depends on it, while users see that the network folder doesn't respond anymore at all.

This goes on forever, until users call me, I come to investigate, see Samba down, see a lot of errors in dmesg, and then I manually mark this drive as faulty.


Now, that happened a while ago; I did not have the most recent kernel on that server (I think it was 3.2 from Debian Wheezy or something a little newer from the backports), but I can't try it again with a new server, because I can't make a functional RAID1, write data there, and then destroy some sectors and see what happens. I just want to ask, is that really how it works?.. Was that supposed to happen?.. I thought the main point of a RAID1 is to avoid any downtime, especially in such cases!.. Or is it maybe a known issue fixed in the more recent versions, so I should just update my kernels and expect different behaviour next time?..


--
darkpenguin
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux