Re: [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode

Song Liu <songliubraving@xxxxxx> · Fri, 14 Oct 2016 07:03:32 +0000

> On Oct 13, 2016, at 11:13 PM, NeilBrown <neilb@xxxxxxxx> wrote:
> 
> On Thu, Oct 13 2016, Song Liu wrote:
> 
>> 
>> /*
>> * raid5 cache state machine
>> *
>> * The RAID cache works in two major states for each stripe: write state
>> * and reclaim state. These states are controlled by flags STRIPE_R5C_FROZEN
>> * and STRIPE_R5C_WRITTEN
> 
> Hi,
> It would really help at this point to understand exactly what it is
> that is "FROZEN".  I guess the contents of the stripe_head are frozen?
> so that nothing changes between writing to the journal and writing to
> the RAID?  Making that explicit would help.

I guess I (wrongly) borrowed the frozen name from Shaohua's earlier
patch, where frozen means "no more writes until flush done". 

In this implementation, STRIPE_R5C_FROZEN is to differentiate write state
and reclaim state (reclaim may not be a good name here though). 

In write state, data writes are handled as:
    write to journal
    return IO
    write to journal
    return IO, etc

In reclaim state, the state machine works as
     calculate parity
     write parity (maybe also some data) to journal
     write data and parity to raid disks
     (maybe) return IO

The reclaim state is very similar to current write journal (or write through 
cache) behavior; while the write state is only applicable to write back cache. 

This particular patch does NOT add detailed logic of the write back cache, 
so the stripe enters reclaim state (set  STRIPE_R5C_FROZEN) immediately
on every write (in handle dirtying). 

> Only... the stripe_head gets "frozen" before that.  It gets frozen
> before the parity is calculated, and not released until the data and
> parity is written to the RAID.  This new "FROZEN" state is a subset of
> that - yes?

Yes, the stripe got frozen before parity is calculated. And released after
data and parity is written to RAID. 

The timeline is like:
    write data to journal, return IO
    write data to journal, return IO
    ....
    frozen
    calculate parity
    write parity to journal
    write data and parity to RAID
    release or un-frozen or melt.

> 
> Given that STRIPE_R5C_FROZEN is set when the stripe is in "reclaim
> state", I wonder why you didn't call it "STRIPE_R5C_RECLAIM"
> .... though I'm not sure "reclaim" is a good word.  Reclaim is
> something that happens to the journal after the data has been written
> to the RAID.
Reclaim might be a good name here. It is the flag that the state machine is
working towards "write all data and parity to RAID". 

>> 
>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> index 7557791b..9e05850 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -40,6 +40,47 @@
>>  */
>> #define R5L_POOL_SIZE	4
>> 
>> +enum r5c_state {
>> +	R5C_STATE_NO_CACHE = 0,
>> +	R5C_STATE_WRITE_THROUGH = 1,
>> +	R5C_STATE_WRITE_BACK = 2,
>> +	R5C_STATE_CACHE_BROKEN = 3,
> 
> How is CACHE_BROKEN different from NO_CACHE??
> Maybe a latter patch will tell me, but I'd rather know now :-)

Yeah, a later patch (sysfs entry r5c_state) has more details. In current write journal
implementation, we force the array read-only when the journal device is broken. 
The array is read-only even after re-assemble, because it still has JOURNAL bit in 
its feature map. So that is CACHE_BROKEN, 

> As this state is stored in the r5l_log, and as we don't allocate that
> when there is no cache, why have a NO_CACHE state?

We only store "write through" or "write back" in r5l_log. However, the r5c_state sysfs
entry has 2 functionalities:
     1. switching between "write through" and "write back";
     2. put array from CACHE_BROKEN to NO_CACHE (remove JOURNAL bit from 
         feature map). 

>> 
>> +/*
>> + * Freeze the stripe, thus send the stripe into reclaim path.
>> + *
>> + * In current implementation, STRIPE_R5C_FROZEN is also set in write through
>> + * mode (in r5c_handle_stripe_dirtying). This does not change the behavior of
>> + * for write through mode.
>> + */
>> +void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)
> 
> "freeze_stripe_for_reclaim" seems an odd name.  How does "reclaim" fit?
> You are freezing the stripe so it can be written to the journal, and
> then eventually to the RAID are you not?
> Or are you freezing it so the space used in the journal cannot be reclaimed?
> 
> Maybe I misunderstand the "FROZEN" concept still.

It is more like "frozen" so we kickoff the reclaim process. 

> 
>> +{
>> +	struct r5conf *conf = sh->raid_conf;
>> +	struct r5l_log *log = conf->log;
>> +
>> +	if (!log)
>> +		return;
>> +	WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
>> +	set_bit(STRIPE_R5C_FROZEN, &sh->state);
>> +}
>> +
>> +static void r5c_finish_cache_stripe(struct stripe_head *sh)
> 
> A comment just before this function saying when it is called would help.
> "Call when a stripe_head is safe in the journal, but has not yet been
> written to the RAID" ??

Yes, exactly "Call when a stripe_head is safe in the journal, but has not yet 
been written to the RAID" 

>> +{
>> +	struct r5l_log *log = sh->raid_conf->log;
>> +
>> +	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
>> +		BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
>> +		set_bit(STRIPE_R5C_WRITTEN, &sh->state);
>> +	} else
>> +		BUG(); /* write back logic in next patch */
>> +}
>> +
>> static void r5l_io_run_stripes(struct r5l_io_unit *io)
>> {
>> 	struct stripe_head *sh, *next;
>> 
>> 	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
>> 		list_del_init(&sh->log_list);
>> +
>> +		r5c_finish_cache_stripe(sh);
>> +
>> 		set_bit(STRIPE_HANDLE, &sh->state);
>> 		raid5_release_stripe(sh);
>> 	}
>> @@ -412,18 +493,19 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>> 		r5l_append_payload_page(log, sh->dev[i].page);
>> 	}
>> 
>> -	if (sh->qd_idx >= 0) {
>> +	if (parity_pages == 2) {
>> 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
>> 					sh->sector, sh->dev[sh->pd_idx].log_checksum,
>> 					sh->dev[sh->qd_idx].log_checksum, true);
>> 		r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
>> 		r5l_append_payload_page(log, sh->dev[sh->qd_idx].page);
>> -	} else {
>> +	} else if (parity_pages == 1) {
>> 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
>> 					sh->sector, sh->dev[sh->pd_idx].log_checksum,
>> 					0, false);
>> 		r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
>> -	}
>> +	} else
>> +		BUG_ON(parity_pages != 0);
> 
> 
> Why this change?  I don't think it is wrong, but I don't see why it
> belongs here.  How does it help?

In the journal only implementation, parity_pages could only be 1 or 2. With
write back cache, parity pages could also be 0 (data only). So this is new 
case to be handled here. The "BUG_ON" part is not really necessary. 

> 
>> +
>> +	if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state))
>> +		return -EAGAIN;
>> +
>> +	if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH ||
>> +	    conf->mddev->degraded != 0) {
> 
> Why do you disable write-back when the array is degraded?

I think this is not really necessary. Our original thought was to simplify the case.
Since the array is already much slower in degraded mode, write-back cache
cannot make it awesome anyway.

I think we can enable write-back in degraded mode later on. 

>> 
>> +	/*
>> +	 * Now to consider new write requests, cache write back and what else,
>> +	 * if anything should be read.  We do not handle new writes when:
>> 	 * 1/ A 'write' operation (copy+xor) is already in flight.
>> 	 * 2/ A 'check' operation is in flight, as it may clobber the parity
>> 	 *    block.
>> +	 * 3/ A r5c cache log write is in flight.
>> 	 */
>> -	if (s.to_write && !sh->reconstruct_state && !sh->check_state)
>> +	if ((s.to_write || test_bit(STRIPE_R5C_FROZEN, &sh->state)) &&
>> +	    !sh->reconstruct_state && !sh->check_state && !sh->log_io)
>> 		handle_stripe_dirtying(conf, sh, &s, disks);
> 
> OK, I'm definitely confused now.
> If the stripe is FROZEN, surely you don't want to call
> handle_stripe_dirtying().

handle_stripe_dirtying() is the beginning of original write path: shall we do 
rcw or rmw? Once frozen, we need to calculate parity and write to RAID, 
so we need handle_stripe_dirtying here. 

Thanks,
Song

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html