Re: possible bug - bitmap dirty pages status

NeilBrown <neilb@xxxxxxx> · Wed, 16 Nov 2011 13:30:45 +1100

On Tue, 15 Nov 2011 10:11:51 +1100 linbloke <linbloke@xxxxxxxxxxx> wrote:
> Hello,
> 
> Sorry for bumping this thread but I couldn't find any resolution 
> post-dated. I'm seeing the same thing with SLES11 SP1. No matter how 
> long I wait or how often I sync(8), the number of dirty bitmap pages 
> does not reduce to zero - 52 has become the new zero for this array 
> (md101). I've tried writing more data to prod the sync  - the result was 
> an increase in the dirty page count (53/465) and then return to the base 
> count (52/465) after 5seconds. I haven't tried removing the bitmaps and 
> am a little reluctant to unless this would help to diagnose the bug.
> 
> This array is part of a nested array set as mentioned in another mail 
> list thread with the Subject: Rotating RAID 1. Another thing happening 
> with this array is that the top array (md106), the one with the 
> filesystem on it, has the file system exported via NFS to a dozen or so 
> other systems. There has been no activity on this array for at least a 
> couple of minutes.
> 
> I certainly don't feel comfortable that I have created a mirror of the 
> component devices. Can I expect the devices to actually be in sync at 
> this point?

Hi,
 thanks for the report.
 I can understand your discomfort.  Unfortunately I haven't been able to
 discover with any confidence what the problem is, so I cannot completely
 relieve that discomfort.  I have found another possible issue - a race that
 could cause md to forget that it needs to clean out a page of the bitmap.
 I could imagine that causing 1 or maybe 2 pages to be stuck, but I don't
 think it can explain 52.

 Can can check if you actually have a mirror by:
    echo check > /sys/block/md101/md/sync_action
 then wait for that to finish and check ..../mismatch_cnt.
 I'm quite confident that will report 0.  I strongly suspect the problem is
 that we forget to clear pages or bits, not that we forget to use them during
 recovery.

 So don't think that keeping the bitmaps will help in diagnosing the
 problem.   We I need is a sequence of events that is likely to produce the
 problem, and I realise that is hard to come by.

 Sorry that I cannot be more helpful yet.

NeilBrown

Attachment:
signature.asc

Description: PGP signature