On Tue, 25 Feb 2014 07:39:14 +1100 Eyal Lebedinsky <eyal@xxxxxxxxxxxxxx> wrote: > My main interest is to understand why 'check' does not actually check. > I already know how to fix the problem, by writing to the location I > can force the pending reallocation to happen, but then I will not have > the test case anymore. > > The OP asks for a specific solution, but I think that the 'check' action > should already correctly rewrite failed (i/o error) sectors. It does not > always know which sector to rewrite when it finds a raid6 mismatch > without an i/o error (with raid5 it never knows). > I cannot reproduce the problem. In my testing a read error is fixed by 'check'. For you it clearly isn't. I wonder what is different. During normal 'check' or 'repair' etc the read requests are allowed to be combined by the io scheduler so when we get a read error, it could be one error for a megabyte of more of the address space. So the first thing raid5.c does is arrange to read all the blocks again but to prohibit the merging of requests. This time any read error will be for a single 4K block. Once we have that reliable read error the data is constructed from the other blocks and the new block is written out. This suggests that when there is a read error you should see e.g. [ 714.808494] end_request: I/O error, dev sds, sector 8141872 then shortly after that another similar error, possibly with a slightly different sector number (at most a few thousand sectors later). Then something like md/raid:md0: read error corrected (8 sectors at 8141872 on sds) However in the log Mikael Abrahamsson posted on 16 Jan 2014 (Subject: Re: read errors not corrected when doing check on RAID6) we only see that first 'end_request' message. No second one and no "read error corrected". This seems to suggest that the second read succeeded, which is odd (to say the least). In your log posted 21 Feb 2014 (Subject: raid 'check' does not provoke expected i/o error) there aren't even any read errors during 'check'. The drive sometimes reports a read error and something doesn't? Does reading the drive with 'dd' already report an error, and with 'check' never report an error? So I'm a bit stumped. It looks like md is doing the right thing, but maybe the drive is getting confused. Are all the people who report this using the same sort of drive?? NeilBrown
Attachment:
signature.asc
Description: PGP signature