On Tue, 15 Nov 2011 10:11:51 +1100 linbloke <linbloke@xxxxxxxxxxx> wrote: > Hello, > > Sorry for bumping this thread but I couldn't find any resolution > post-dated. I'm seeing the same thing with SLES11 SP1. No matter how > long I wait or how often I sync(8), the number of dirty bitmap pages > does not reduce to zero - 52 has become the new zero for this array > (md101). I've tried writing more data to prod the sync - the result was > an increase in the dirty page count (53/465) and then return to the base > count (52/465) after 5seconds. I haven't tried removing the bitmaps and > am a little reluctant to unless this would help to diagnose the bug. > > This array is part of a nested array set as mentioned in another mail > list thread with the Subject: Rotating RAID 1. Another thing happening > with this array is that the top array (md106), the one with the > filesystem on it, has the file system exported via NFS to a dozen or so > other systems. There has been no activity on this array for at least a > couple of minutes. > > I certainly don't feel comfortable that I have created a mirror of the > component devices. Can I expect the devices to actually be in sync at > this point? Hi, thanks for the report. I can understand your discomfort. Unfortunately I haven't been able to discover with any confidence what the problem is, so I cannot completely relieve that discomfort. I have found another possible issue - a race that could cause md to forget that it needs to clean out a page of the bitmap. I could imagine that causing 1 or maybe 2 pages to be stuck, but I don't think it can explain 52. Can can check if you actually have a mirror by: echo check > /sys/block/md101/md/sync_action then wait for that to finish and check ..../mismatch_cnt. I'm quite confident that will report 0. I strongly suspect the problem is that we forget to clear pages or bits, not that we forget to use them during recovery. So don't think that keeping the bitmaps will help in diagnosing the problem. We I need is a sequence of events that is likely to produce the problem, and I realise that is hard to come by. Sorry that I cannot be more helpful yet. NeilBrown
Attachment:
signature.asc
Description: PGP signature