RE: [PATCH] md: Add ability for disable bad block management

"Kwolek, Adam" <adam.kwolek@xxxxxxxxx> · Wed, 7 Dec 2011 11:10:06 +0000

> -----Original Message-----
> From: NeilBrown [mailto:neilb@xxxxxxx]
> Sent: Wednesday, December 07, 2011 2:53 AM
> To: Kwolek, Adam
> Cc: linux-raid@xxxxxxxxxxxxxxx; Ciechanowski, Ed; Labun, Marcin; Williams,
> Dan J
> Subject: Re: [PATCH] md: Add ability for disable bad block management
> 
> On Tue, 6 Dec 2011 13:02:21 +0000 "Kwolek, Adam"
> <adam.kwolek@xxxxxxxxx>
> wrote:
> 
> >
> >
> > > -----Original Message-----
> > > From: NeilBrown [mailto:neilb@xxxxxxx]
> > > Sent: Tuesday, December 06, 2011 7:05 AM
> > > To: Kwolek, Adam
> > > Cc: linux-raid@xxxxxxxxxxxxxxx; Ciechanowski, Ed; Labun, Marcin;
> > > Williams, Dan J
> > > Subject: Re: [PATCH] md: Add ability for disable bad block
> > > management
> > >
> > > On Wed, 30 Nov 2011 08:17:32 +0000 "Kwolek, Adam"
> > > <adam.kwolek@xxxxxxxxx>
> > > wrote:
> > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: NeilBrown [mailto:neilb@xxxxxxx]
> > > > > Sent: Wednesday, November 30, 2011 1:14 AM
> > > > > To: Kwolek, Adam
> > > > > Cc: linux-raid@xxxxxxxxxxxxxxx; Ciechanowski, Ed; Labun, Marcin;
> > > > > Williams, Dan J
> > > > > Subject: Re: [PATCH] md: Add ability for disable bad block
> > > > > management
> > > > >
> > > > > On Thu, 24 Nov 2011 13:19:53 +0100 Adam Kwolek
> > > > > <adam.kwolek@xxxxxxxxx> wrote:
> > > > >
> > > > > > When external metadata doesn't support BBM, mdadm cannot
> > > > > > answer correctly for BBM requests. It causes reshape process being
> stopped.
> > > > > >
> > > > > > Add ability for external metadata /mdadm/ to disable BBM via sysfs.
> > > > > > md will ignore bad blocks as it is for metadata v0.90.
> > > > >
> > > > > This should not be necessary.
> > > > >
> > > > > The intention is that a device with a bad block looks exactly
> > > > > like a device with a failed device.  i.e. 'faulty' and 'blocked' appear in
> the 'state'
> > > > > file.
> > > > >
> > > > > If the metadata doesn't support a bad-block list, it will record
> > > > > that the device has failed and will unblock the device.  At that
> > > > > point the
> > > failure is forced.
> > > > > If the metadata does support a bad block list it will just
> > > > > record the bad blocks and acknowledge them, and the unblock the
> > > > > device.  At that point the device won't be failed, the 'faulty'
> > > > > state will disappear, and it will continue to be used with the known
> bad blocks.
> > > > >
> > > > > What exactly is going wrong that makes you think you need this
> patch?
> > > >
> > > >
> > > > When degradation occurs during migration BBM is signaled to mdmon
> > > > and
> > > mdmon /monitor.c/ tries to mark disk  '-blocked'
> > > > This operation fails. Momon goes in to loop, and nothing can be
> > > > done /I
> > > cannot make it using sysfs/ to signal or remove device.
> > > > In sysfs device is present in /sys/block/mdXXX/md but entry
> > > /sys/block/mdXXX/md/dev-sdX/~block is missing /disk was pulled out/.
> > >
> > >
> > > I've found a couple of issues.  I'm not sure if they completely
> > > explain what you are seeing.  Could you please test with these two
> > > fixes and tell me the results?
> > >
> > > Firstly, I find that writing "-blocked" succeeds (no error returned)
> > > but the "blocked" flag does not get cleared, which is certainly confusing.
> > >
> > > This is fixed by:
> > >
> > > diff --git a/drivers/md/md.c b/drivers/md/md.c index
> > > 4adcbb4..7258dc1
> > > 100644
> > > --- a/drivers/md/md.c
> > > +++ b/drivers/md/md.c
> > > @@ -2562,7 +2562,8 @@ state_show(struct md_rdev *rdev, char *page)
> > >  		sep = ",";
> > >  	}
> > >  	if (test_bit(Blocked, &rdev->flags) ||
> > > -	    rdev->badblocks.unacked_exist) {
> > > +	    (rdev->badblocks.unacked_exist
> > > +	     && !test_bit(Faulty, &rdev->flags))) {
> > >  		len += sprintf(page+len, "%sblocked", sep);
> > >  		sep = ",";
> > >  	}
> > >
> > >
> > > Secondly mdmon writes "-blocked" even when the "blocked" flag is not
> set.
> > > This succeeds so state_store() calls
> > > 		sysfs_notify_dirent_safe(rdev->sysfs_state);
> > >
> > > so mdmon/monitor.c is woken up to go around the loop again and it
> > > writes "- blocked" again and so it continues in a loop.
> > >
> > > This is fixed by:
> > >
> > > diff --git a/monitor.c b/monitor.c
> > > index b002e90..29bde18 100644
> > > --- a/monitor.c
> > > +++ b/monitor.c
> > > @@ -339,7 +339,8 @@ static int read_and_act(struct active_array *a)
> > >  			a->container->ss->set_disk(a, mdi->disk.raid_disk,
> > >  						   mdi->curr_state);
> > >  			check_degraded = 1;
> > > -			mdi->next_state |= DS_UNBLOCK;
> > > +			if (mdi->curr_state & DS_BLOCKED)
> > > +				mdi->next_state |= DS_UNBLOCK;
> > >  			if (a->curr_state == read_auto) {
> > >  				a->container->ss->set_array_state(a, 0);
> > >  				a->next_state = active;
> > >
> > >
> > > Finally, when a badblock is added to the list we don't currently
> > > notify
> > > rdev->sysfs_state so mdmon doesn't notice straight away and so is
> > > rdev->delayed in
> > > taking action.  It will only notice when a write blocks.
> > >
> > > This is fixed by:
> > >
> > > diff --git a/drivers/md/md.c b/drivers/md/md.c index
> > > 4adcbb4..9cc7983
> > > 100644
> > > --- a/drivers/md/md.c
> > > +++ b/drivers/md/md.c
> > > @@ -7940,6 +7941,7 @@ int rdev_set_badblocks(struct md_rdev *rdev,
> > > sector_t s, int sectors,
> > >  				  s + rdev->data_offset, sectors,
> acknowledged);
> > >  	if (rv) {
> > >  		/* Make sure they get written out promptly */
> > > +		sysfs_notify_dirent_safe(rdev->sysfs_state);
> > >  		set_bit(MD_CHANGE_CLEAN, &rdev->mddev->flags);
> > >  		md_wakeup_thread(rdev->mddev->thread);
> > >  	}
> > >
> > >
> > > With these 3 changes in place I get substantially improved behaviour
> > > on my simple test (just doing resync, not reshape).
> > >
> > > Thanks,
> > > NeilBrown
> >
> > I've applied those changes and:
> > 1.  Migration:
> > 	a) with additionally disabled BBM, reshape continues after
> degradation and performance is not lower (without your patches
> performance was poor and mdmon goes in to "crazy" run).
> > 	b) with enabled BBM (without my change), metadata is updated
> > correctly and md stops. mdstat shows that reshape is in progress but it is
> not moving forward 2. Rebuild:
> > 	a) with additionally disabled BBM, rebuild is stopped  correctly in md
> and metadata just after degradation (I've got few additional corrections for
> metadata rebuild finalization, I'll post it shortly).
> > 	b) with enabled BBM (without my change), metadata is updated
> > correctly and md stops. mdstat shows that rebuild is in progress but
> > it is not moving forward
> >
> >
> > It seems that those changes helps for reshape performance drop after
> degradation and "crazy" mdmon run.
> > In md without blocking BBM still md_do_sync() doesn't finish on
> degradation during reshape and rebuild. This causes process to be stopped.
> > The last information from md is print out from md_error() and it probably
> waits on BBM confirmation.
> >
> > What can be different in my tests is that I physically pull out disks to get raid
> degraded (I'm not using sysfs to do this). After this rdev link in md device is
> invalid.
> >
> > Please let me know if you want to any additional tests made by me /any
> specific logs?/.
> >
> >
> 
> I cannot reproduce this.
> I didn't physically remove devices, but I used
>    echo 1 > /sys/block/sdc/device/delete
> which should be nearly identical from the perspective of md and mdadm.

I've checked that when I'm deleting device using sysfs  everything works perfect. 
When when device is pulled out, reshape stops in md/mdstat.

> If you could give me the exact set of steps that you follow to produce the
> problem that would help - maybe a script?  Just a description is OK.

#used disks sdb, sdc, sdd, sde
export IMSM_NO_PLATFORM=1
#create container
mdadm -C /dev/md/imsm0 -amd -e imsm -n 3 /dev/sdb /dev/sdc /dev/sde -R
#create vol
mdadm -C /dev/md/raid5vol_0 -amd -l 5 --chunk 32 --size 104850 -n 3 /dev/sdb /dev/sdc /dev/sde -R
#add spare
mdadm --add /dev/md/imsm0 /dev/sdd
#run OLCE
mdadm --grow /dev/md/imsm0 --raid-devices 4
#when reshape starts, I'm (physically) pulling device out

> Also you say it is blocking in md_do_sync.  Is that at the
> 
> 	wait_event(mddev->recovery_wait, !atomic_read(&mddev-
> >recovery_active));
> 
> call just after the "out:" label?

None of those 2 places.
It enters sync_request() function. Md_error() is called. 
More is visible on thread stack information below (md_wait_for_blocked_rdev()).

> 
> What is the raid thread doing at this point?
>    cat /proc/PID/stack
> might help.

[md126_raid5]
[<ffffffff8121d843>] md_wait_for_blocked_rdev+0xbc/0x10f
[<ffffffffa01d87ce>] handle_stripe+0x1c5c/0x2c99 [raid456]
[<ffffffffa01d9d0d>] raid5d+0x502/0x564 [raid456]
[<ffffffff8121eca5>] md_thread+0x101/0x11f
[<ffffffff81049e0e>] kthread+0x81/0x89
[<ffffffff812cc4f4>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff

[md126_reshape]
[<ffffffffa02455a2>] sync_request+0x90a/0xbfb [raid456]
[<ffffffff8121e151>] md_do_sync+0x7aa/0xc40
[<ffffffff8121ecb3>] md_thread+0x101/0x11f
[<ffffffff81049e0e>] kthread+0x81/0x89
[<ffffffff812cc4f4>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff

> 
> What are the contents of all the sysfs files?
>    grep . /sys/block/mdXXX/md/*
array_state		->active
degraded		->1
max_read_errors	->20
reshape_position	->12288
resync_start		->none
sync_completed	->4096 / 209664

>    grep . /sys/block/mdXXX/md/dev-*/*

When removed is sdd   /sys/block/mdXXX/md/dev-sdd/*
bad_blocks		->4096 512
			->4608 128
			->4736 384
block			->MISSING link is not valid
errors			->0
offset			->0
recovery_start		->4096
size			->104832
slot			->3
state			->faulty,write_error
unacknowledged_bad_blocks	->4096 512
				->4608 128
				->4736 384

I hope this helps.

BR
Adam

> Thanks,
> NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html