Re: [PATCH v5 4/8] md/r5cache: write part of r5cache

Song Liu <songliubraving@xxxxxx> · Fri, 14 Oct 2016 07:33:03 +0000

> On Oct 13, 2016, at 11:53 PM, NeilBrown <neilb@xxxxxxxx> wrote:
> 
> On Thu, Oct 13 2016, Song Liu wrote:
>> 
>> For RMW, the code allocates an extra page for each data block
>> being updated.  This is stored in r5dev->page and the old data
>> is read into it.  Then the prexor calculation subtracts ->page
>> from the parity block, and the reconstruct calculation adds the
>> ->orig_page data back into the parity block.
> 
> What happens if the alloc_page() fails?

That will be tough, but solvable.. We can
    read old data to page
    do prexor 
    read new data from journal device to page
    do xor 
    do the rest of the work. 

Or we can force the code to rcw, which does not need extra page. 
But rcw, does not always work in degraded mode. So, this is a good 
reason not to do write-back in degraded mode...

>> 
>> This patch includes a fix proposed by ZhengYuan Liu
>> <liuzhengyuan@xxxxxxxxxx>
>> 
>> Signed-off-by: Song Liu <songliubraving@xxxxxx>
> 
> You introduce a new flag:
>> +	R5_InCache,	/* Data in cache */
> 
> but don't document it at all (apart from those three words).
> I must confess that I found it confuses when you talk about the
> "journal" as the "cache".  I understand way, but in my mind, the "cache"
> is the in-memory cache of stripe_heads.
> We now have something that we sometimes call a "journal", sometimes call
> a "log" and sometimes call a "cache".  That is a recipe for confusion.
> 
> A sentence or two explaining how this flag will be used would help in
> reading the rest of the code.

I will try to unify the names here, and add more comments. 

> 
>> +
>> +void r5c_handle_cached_data_endio(struct r5conf *conf,
>> +	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
>> +{
>> +	int i;
>> +
>> +	for (i = sh->disks; i--; ) {
>> +		if (test_bit(R5_InCache, &sh->dev[i].flags) &&
>> +		    sh->dev[i].written) {
> 
> Is it possible for R5_InCache to be set, but 'written' to be NULL ???

Yes, it is possible. A stripe may go through "write data to journal, return IO" multiple
times before parity calculation. When it comes here the second time, dev written in the 
first time will have R5_InCache set, but its written will be NULL. 

> 
>> 
>> static void r5c_finish_cache_stripe(struct stripe_head *sh)
>> @@ -242,8 +323,10 @@ static void r5c_finish_cache_stripe(struct stripe_head *sh)
>> 	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
>> 		BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
>> 		set_bit(STRIPE_R5C_WRITTEN, &sh->state);
>> -	} else
>> -		BUG(); /* write back logic in next patch */
>> +	} else if (test_bit(STRIPE_R5C_FROZEN, &sh->state))
>> +		r5c_handle_parity_cached(sh);
>> +	else
>> +		r5c_handle_data_cached(sh);
>> }
>> 
>> static void r5l_io_run_stripes(struct r5l_io_unit *io)
>> @@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>> 	io = log->current_io;
>> 
>> 	for (i = 0; i < sh->disks; i++) {
>> -		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
>> +		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
>> +		    !test_bit(R5_Wantcache, &sh->dev[i].flags))
>> 			continue;
> 
> If changed R5_Wantcache to R5_Wantjournal, and always set it on blocks
> that were destined for the journal, then this would just be
> 
> 		if (!test_bit(R5_Wantjournal, &sh->dev[i].flags))
> 
> which would make lots of sense...  Just a thought.

We set R5_Wantwrite in multiple places. If we simplify the code here, we will need to make
those places aware of journal. I guess that is not ideal either? 

> 
>> }
>> 
>> -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>> +void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
> 
> Why are you removing the 'static' here?  You don't call it from any
> other file.

In next patch, it is called in raid.c. 

> 
>> 
>> 
>> static int r5l_load_log(struct r5l_log *log)
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 2e3e61a..0539f34 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -245,8 +245,25 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
>> 			    < IO_THRESHOLD)
>> 				md_wakeup_thread(conf->mddev->thread);
>> 		atomic_dec(&conf->active_stripes);
>> -		if (!test_bit(STRIPE_EXPANDING, &sh->state))
>> -			list_add_tail(&sh->lru, temp_inactive_list);
>> +		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
>> +			if (atomic_read(&sh->dev_in_cache) == 0) {
>> +				list_add_tail(&sh->lru, temp_inactive_list);
>> +			} else if (atomic_read(&sh->dev_in_cache) ==
>> +				   conf->raid_disks - conf->max_degraded) {
> 
> Is this really the test that you want?
> Presumably "full stripes" are those that can be written out without
> reading anything, while "partial stripes" will need something read in
> first.
> It is possible that the stripe head contains data for some devices which
> is not in the cache, possibly because a READ request filled it in.
> So we should really be counting the number of devices with R5_UPTODATE.
> ??

Yes, counting R5_UPTODATE will be better. Let me check whether there are
corner cases we need to cover. 

Thanks,
Song

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html