RE: [PATCH 7/8] md: skip resync for raid array with journal

Neil Brown <neilb@xxxxxxx> · Fri, 02 Oct 2015 09:50:40 +1000

Song Liu <songliubraving@xxxxxx> writes:

>> -----Original Message-----
>> From: Neil Brown [mailto:neilb@xxxxxxx]
>> Sent: Tuesday, September 29, 2015 9:25 PM
>> To: Shaohua Li; linux-raid@xxxxxxxxxxxxxxx
>> Cc: Kernel Team; Song Liu; hch@xxxxxxxxxxxxx; dan.j.williams@xxxxxxxxx
>> Subject: Re: [PATCH 7/8] md: skip resync for raid array with journal
>> 
>> Shaohua Li <shli@xxxxxx> writes:
>> 
>> > If a raid array has journal, the journal can guarantee the
>> > consistency, we can skip resync after a unclean shutdown. The
>> > exception is raid creation or user initiated resync, which we still do a raid
>> resync.
>> >
>> > Signed-off-by: Shaohua Li <shli@xxxxxx>
>> > ---
>> >  drivers/md/md.c | 4 ++++
>> >  drivers/md/md.h | 1 +
>> >  2 files changed, 5 insertions(+)
>> >
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c index b3f9eed..95824fb
>> > 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -1669,6 +1669,8 @@ static int super_1_validate(struct mddev *mddev,
>> struct md_rdev *rdev)
>> >  			}
>> >  			set_bit(Journal, &rdev->flags);
>> >  			rdev->journal_tail = le64_to_cpu(sb->journal_tail);
>> > +			if (mddev->recovery_cp == MaxSector)
>> > +				set_bit(MD_JOURNAL_CLEAN, &mddev-
>> >flags);
>> >  			break;
>> >  		default:
>> >  			rdev->saved_raid_disk = role;
>> > @@ -1711,6 +1713,8 @@ static void super_1_sync(struct mddev *mddev,
>> struct md_rdev *rdev)
>> >  	sb->events = cpu_to_le64(mddev->events);
>> >  	if (mddev->in_sync)
>> >  		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
>> > +	else if (test_bit(MD_JOURNAL_CLEAN, &mddev->flags))
>> > +		sb->resync_offset = cpu_to_le64(MaxSector);
>> >  	else
>> >  		sb->resync_offset = cpu_to_le64(0);
>> >
>> > diff --git a/drivers/md/md.h b/drivers/md/md.h index 226f4ba..0288a0b
>> > 100644
>> > --- a/drivers/md/md.h
>> > +++ b/drivers/md/md.h
>> > @@ -236,6 +236,7 @@ struct mddev {
>> >  #define MD_STILL_CLOSED	4	/* If set, then array has not been
>> opened since
>> >  				 * md_ioctl checked on it.
>> >  				 */
>> > +#define MD_JOURNAL_CLEAN 5	/* A raid with journal is already clean
>> */
>> >
>> >  	int				suspended;
>> >  	atomic_t			active_io;
>> > --
>> > 1.8.1
>> 
>> This looks right as far as it goes, but I don't think it goes far enough.
>> The particular scenario that bothers me is if the array is started without the
>> journal being present.
>> I cannot see anything to prevent that - is there?
>
> I have sent mdadm patches that prevent the array to assemble with missing 
> journal, for both --assemble and --incremental:
>
> http://marc.info/?l=linux-raid&m=144080445720123
> http://marc.info/?l=linux-raid&m=144080446120127 
>

That is probably good and useful, but the kernel cannot rely on mdadm to
protect it.  If the kernel is asked to start a journalled array when no
journal is present, it must do something that is safe.
That might be to refuse, it might be to force a resync, it might be to
somehow "know" if the journal is empty and only force a resync in that
case.

This should be treated much like a "dirty/degraded" start.  Currently
the kernel refuses to do that unless a module parameter is set.  When
mdadm is asked to --force start a dirty/degraded array, it modifies the
metadata so that it doesn't appear to be degraded.  Something similar
might be best for the missing-journal case.

I *think* the current kernel code will allow the array to be started
without a resync even if the journal was not empty, and that is not
safe.

> We can force start in both cases with --force or --run. 
>
> Currently, mdadm doesn't clear MD_FEATURE_JOURNAL bit, so we get warning 
> every time. 
>
> I did miss one step: to mark the missing journal as fault. 
>
>> 
>> In that case we need to assume that the array is not in-sync, and we need to
>> clear MD_FEATURE_JOURNAL so if it gets stopped and then assembled again
>> with the stale journal doesn't get used.
>> 
>> One unfortunate side effect of that is that you couldn't stop the array cleanly
>> (leaving the journal effectively empty) and then restart with no journal and
>> no resync.  Is that a problem I wonder?
>> 
>> I'm not sure what the best solution is here, but we need a clear
>> understanding of what happens if you try to assemble an array without the
>> journal where previously it had one, and I don't think the current code gets it
>> right.
>
> I just talked with Shaohua. We think we can probably do the following:
>
> 1. When assemble with cache device missing, show the warning (with --force 
> option). Run resync if previous shutdown is not clean (this is the default
> behavior.

Probably sensible.  But I think the current code never marks the array as
"not clean" when there is a journal.

>
> 2. When we get I/O error on the journal device at run time, make the whole
> array as read only.

Yes, I think that is best.

>
> 3. Add new operation that adds new journal device to an array with missing 
> journal. We can limit this option to inactive array only.

It would be great if we could add a journal to an active array...

>
> Would this follow cover all the cases?

Quite possibly.  I don't want to commit myself until I see the code :-)

Thanks,
NeilBrown
Attachment:
signature.asc

Description: PGP signature