Re: [PATCH v2 1/6] r5cache: write part of r5cache

Song Liu <songliubraving@xxxxxx> · Thu, 29 Sep 2016 23:06:38 +0000

Thanks Shaohua!

I am working on adding more information in comments and commit notes. 

>> 
>> +/*
>> + * Freeze the stripe, thus send the stripe into reclaim path.
>> + *
>> + * This function should only be called from raid5d that handling this stripe,
>> + * or when holds conf->device_lock
>> + */
> 
> Do you mean if this called in raid5d, the lock isn't required? Please note we
> could have several threads (like raid5d) handle stripes. Is there any problem
> here?

This requirement is not necessary anymore. I will update the comment here. 

> 
>> +void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)
>> +{
>> +	struct r5conf *conf = sh->raid_conf;
>> +
>> +	if (!conf->log)
>> +		return;
>> +static void r5c_handle_parity_cached(struct stripe_head *sh)
>> +{
>> +	int i;
>> +
>> +	for (i = sh->disks; i--; )
>> +		if (test_bit(R5_InCache, &sh->dev[i].flags))
>> +			set_bit(R5_Wantwrite, &sh->dev[i].flags);
>> +	set_bit(STRIPE_R5C_WRITTEN, &sh->state);
>> +}
>> +
>> +static void r5c_finish_cache_stripe(struct stripe_head *sh)
>> +{
> 
> I'm hoping this one has a
> 	if (not in writeback mode)
> 		return
> so it's clearly this is only for writeback mode.
> 
>> +	if (test_bit(STRIPE_R5C_FROZEN, &sh->state))
>> +		r5c_handle_parity_cached(sh);
>> +	else
>> +		r5c_handle_data_cached(sh);
>> +}
>> +
>> 
>> 	return 0;
>> }
>> 
>> -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
>> /*
>>  * running in raid5d, where reclaim could wait for raid5d too (when it flushes
>>  * data from log to raid disks), so we shouldn't wait for reclaim here
>> @@ -456,11 +541,17 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
>> 		return -EAGAIN;
>> 	}
>> 
>> +	WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
> 
> is this set even for write through?

In current implementation, the code still sets STRIPE_R5C_FROZEN in write through mode 
(in r5c_handle_stripe_dirtying). The reclaim path handles has same behavior in write through 
mode as original journal.

>> 
>> +
>> +	if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state))
>> +		return -EAGAIN;
>> +
>> +	if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH ||
>> +	    conf->quiesce != 0 || conf->mddev->degraded != 0) {
>> +		/* write through mode */
>> +		r5c_freeze_stripe_for_reclaim(sh);
>> +		return -EAGAIN;
> 
> will this change behavior of R5C_STATE_WRITE_THROUGH?
> r5c_freeze_stripe_for_reclaim does change something not related to writethrough
> mode.

The PREREAD_ACTIVE part of freeze is not necessary for write through. I will fix it. 
Other that, the reclaim path works the same as write through mode (as original journal
code). I will double check and make it clear. 

> The quiesce check sounds not necessary. The patch flush all caches in quiesce,
> so no IO is running in quiesce state.

Great catch. Other checks already covers all cases in quiesce. 

>> 
>> @@ -901,6 +918,13 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>> 
>> 	might_sleep();
>> 
>> +	if (s->to_cache) {
>> +		if (r5c_cache_data(conf->log, sh, s) == 0)
>> +			return;
> 
> At this stage we fallback to no cache. But I don't see R5_Wantcache is cleared,
> is it a problem?

We did similar check for journal part. I think it will be easier if we add a check in 
r5l_init_log(), and never create journal/cache when the array is too big? 

> 
>> @@ -1543,9 +1570,18 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
>> static void ops_complete_prexor(void *stripe_head_ref)
>> {
>> 	struct stripe_head *sh = stripe_head_ref;
>> +	int i;
>> 
>> 	pr_debug("%s: stripe %llu\n", __func__,
>> 		(unsigned long long)sh->sector);
>> +
>> +	for (i = sh->disks; i--; )
>> +		if (sh->dev[i].page != sh->dev[i].orig_page) {
>> +			struct page *p = sh->dev[i].page;
>> +
>> +			sh->dev[i].page = sh->dev[i].orig_page;
>> +			put_page(p);
> What if array hasn't log but skipcopy is enabled?

In case of skip copy, page and orig_page are the same during ops_complete_prexor: 
    page == orig_page == old data. 
    bio_page is the new data. 

After prexor is done, we run biodrain, where page is set to bio_page to skip copy. 

>> 
>> @@ -4416,7 +4538,7 @@ static void handle_stripe(struct stripe_head *sh)
>> 			struct r5dev *dev = &sh->dev[i];
>> 			if (test_bit(R5_LOCKED, &dev->flags) &&
>> 				(i == sh->pd_idx || i == sh->qd_idx ||
>> -				 dev->written)) {
>> +				 dev->written || test_bit(R5_InCache, &dev->flags))) {
>> 				pr_debug("Writing block %d\n", i);
>> 				set_bit(R5_Wantwrite, &dev->flags);
>> 				if (prexor)
>> @@ -4456,6 +4578,12 @@ static void handle_stripe(struct stripe_head *sh)
>> 				 test_bit(R5_Discard, &qdev->flags))))))
>> 		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
>> 
>> +	if (s.just_cached)
>> +		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
>> +
>> +	if (test_bit(STRIPE_R5C_FROZEN, &sh->state))
>> +		r5l_stripe_write_finished(sh);
> what's this for?

r5l_stripe_write_finished() removes sh->log_io. 

It is actually not necessary to check for R5C_FROZEN. I will fix that. 

> 
>> +
>> 	/* Now we might consider reading some blocks, either to check/generate
>> 	 * parity, or to satisfy requests
>> 	 * or to load a block that is being partially written.
>> @@ -4467,13 +4595,17 @@ static void handle_stripe(struct stripe_head *sh)
>> 	    || s.expanding)
>> 		handle_stripe_fill(sh, &s, disks);
>> 
>> -	/* Now to consider new write requests and what else, if anything
>> -	 * should be read.  We do not handle new writes when:
>> +	r5c_handle_stripe_written(conf, sh);
> 
> please explain why this is required?

Added the following in comments:

        /*
         * If the stripe finishes full journal write cycle (write to journal
         * and raid disk, this is the clean up procedure so it is ready for
         * next operation.
         */

>> 
>> 	/* maybe we need to check and possibly fix the parity for this stripe
>> @@ -5192,7 +5324,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
>> 	 * later we might have to read it again in order to reconstruct
>> 	 * data on failed drives.
>> 	 */
>> -	if (rw == READ && mddev->degraded == 0 &&
>> +	if (rw == READ && mddev->degraded == 0 && conf->log == NULL &&
>> 	    mddev->reshape_position == MaxSector) {
>> 		bi = chunk_aligned_read(mddev, bi);
>> 		if (!bi)
> This should be only for R5C_STATE_WRITE_BACK.

Will fix this. 

>> 
>> @@ -7662,8 +7800,11 @@ static void raid5_quiesce(struct mddev *mddev, int state)
>> 		/* '2' tells resync/reshape to pause so that all
>> 		 * active stripes can drain
>> 		 */
>> +		r5c_flush_cache(conf, 0);
>> 		conf->quiesce = 2;
>> 		wait_event_cmd(conf->wait_for_quiescent,
>> +				    atomic_read(&conf->r5c_cached_partial_stripes) == 0 &&
>> +				    atomic_read(&conf->r5c_cached_full_stripes) == 0 &&
> I don't see a wakeup for this after the check condition is met.

The following in release_inactive_stripe_list will wake this up. 
                if (atomic_read(&conf->active_stripes) == 0)
                        wake_up(&conf->wait_for_quiescent);

This is OK because cached stripes (full stripe or partial) has to become active
first before being inactive. 

Thanks,
Song

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html