Re: [PATCH v2 3/6] r5cache: reclaim support

Song Liu <songliubraving@xxxxxx> · Tue, 4 Oct 2016 21:59:29 +0000

> On Sep 27, 2016, at 5:34 PM, Shaohua Li <shli@xxxxxxxxxx> wrote:
> 
>> 
>> +	if (!conf->log)
>> +		return;
>> +	spin_lock(&conf->device_lock);
>> +	if (r5c_total_cached_stripes(conf) > conf->max_nr_stripes * 3 / 4 ||
>> +	    atomic_read(&conf->empty_inactive_list_nr) > 0)
>> +		r5c_flush_cache(conf, R5C_RECLAIM_STRIPE_GROUP);
> 
> I still worry about the max_nr_stripes usage. It can be changed at runtime. If
> there are no enough stripes, should we just allocate more stripes or reclaim
> stripe cache? If memory system tries to shrink stripes (eg, decrease
> max_nr_stripes), will it cause deadlock for r5cache?
> 

Write cache will not have dead lock due to stripe cache usage, because cached
stripe will hold a stripe until it is flushed to the raid disks. 

>> +	else if (r5c_total_cached_stripes(conf) >
>> +		 conf->max_nr_stripes * 1 / 2)
>> +		r5c_flush_cache(conf, 1);
> 
> This one is a defensive reclaim. It should always reclaim stripes with full
> data. If there are no enough such stripes, do nothing. Flushing 1 stripe would
> always be wrong unless we are in critical stripe space shortage, as reclaim
> involves disk cache flush and is slow, we should do aggretation as much as
> possible.

I will remove the defensive reclaim. 

> 
>>  *
>> @@ -198,10 +260,9 @@ void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)
>> {
>> 	struct r5conf *conf = sh->raid_conf;
>> 
>> -	if (!conf->log)
>> +	if (!conf->log || test_bit(STRIPE_R5C_FROZEN, &sh->state))
>> 		return;
>> 
>> -	WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
>> 	set_bit(STRIPE_R5C_FROZEN, &sh->state);
> 
> This is confusing. The WARN_ON suggests the STRIPE_R5C_FROZEN isn't set for sh,
> but the change suggests it's possible the bit is set. Which one is correct?
> 

Will fix this. 

>> 
>> @@ -518,6 +583,14 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>> 	atomic_inc(&io->pending_stripe);
>> 	sh->log_io = io;
>> 
>> +	if (sh->log_start == MaxSector) {
>> +		BUG_ON(!list_empty(&sh->r5c));
>> +		sh->log_start = io->log_start;
>> +		spin_lock_irqsave(&log->stripe_in_cache_lock, flags);
>> +		list_add_tail(&sh->r5c,
>> +			      &log->stripe_in_cache);
>> +		spin_unlock_irqrestore(&log->stripe_in_cache_lock, flags);
>> +	}
> what if it's in writethrogh mode?

This does not impact write through mode. But let me add clear check just in case. 
> 
>> +		last_checkpoint = (list_first_entry(&log->stripe_in_cache,
>> +						    struct stripe_head, r5c))->log_start;
>> +		spin_unlock(&log->stripe_in_cache_lock);
>> +		if (sh->log_start != last_checkpoint) {
>> +			spin_lock(&log->no_space_stripes_lock);
>> +			list_add_tail(&sh->log_list, &log->no_space_stripes);
>> +			spin_unlock(&log->no_space_stripes_lock);
>> +			mutex_unlock(&log->io_mutex);
>> +			return -ENOSPC;
> 
> So if a stripe is in cache, we try to reclaim it. We should have some mechanism
> to guarantee there are enough space for reclaim (eg for parity). Otherwise
> there could be a deadlock because the space allocation in reclaim path is to
> free space. Could you please explain how this is done?
> 

I redefined this part in newer version, which should be clear. 

>> +		} else 	if (!r5l_has_free_space(log, reserve)) {
>> +			WARN(1, "%s: run out of journal space\n", __func__);
>> +			BUG();
> that's scaring, why it happens?
> 

I rewrite some of the code in newer version. But in case something similar happens, it
is a bug with reclaim. 

>> 
>> +		if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
>> +		    sh->log_start != last_checkpoint)
>> +			continue;
> what's this check for?

With this check, the code only flushes stripes at last_checkpoint on journal. This is no 
longer needed in newer version. I will remove it. 

> 
>> 		list_del_init(&sh->log_list);
>> +
>> static sector_t r5l_reclaimable_space(struct r5l_log *log)
>> {
>> +	struct r5conf *conf = log->rdev->mddev->private;
>> +
>> 	return r5l_ring_distance(log, log->last_checkpoint,
>> -				 log->next_checkpoint);
>> +				 r5c_calculate_last_cp(conf));
> will this work for writethrouth?

r5c_calculate_last_cp() returns next_checkpoint for write through mode. 
> 
>> }
>> 
>> 
>> 	if (reclaimable == 0)
>> 		return;
>> -
>> 	/*
>> 	 * write_super will flush cache of each raid disk. We must write super
>> 	 * here, because the log area might be reused soon and we don't want to
>> @@ -877,10 +995,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>> 
>> 	mutex_lock(&log->io_mutex);
>> 	log->last_checkpoint = next_checkpoint;
>> -	log->last_cp_seq = next_cp_seq;
> why not update last_cp_seq?

Currently, we don't need last_cp_seq anywhere in the code. To keep track of 
last_cp_seq, we need add this field to stripe_head. From what I can this, this 
seems not necessary. 

>> 	mutex_unlock(&log->io_mutex);
>> -
>> -	r5l_run_no_space_stripes(log);
> 
> I don't understand why move r5l_run_no_space_stripes to r5c_flush_cache. It's
> natural we run this after some spaces are reclaimed.

I will move it back. 

>> }
>> 
>> static void r5l_reclaim_thread(struct md_thread *thread)
>> @@ -891,7 +1006,9 @@ static void r5l_reclaim_thread(struct md_thread *thread)
>> 
>> 	if (!log)
>> 		return;
>> +	r5c_do_reclaim(conf);
>> 	r5l_do_reclaim(log);
>> +	md_wakeup_thread(mddev->thread);
> 
> this wakeup is a bit strange. After we reclaim some spaces, we will rerun
> pending stripes, which will wakeup mddev->thread. Do miss some wakeup in other
> reclaim places?

This one is not really necessary. I will remove it. 

>> }
>> 
>> void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>> 
>> }
>> 
>> -static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
>> +/*
>> + * r5c_flush_cache will move stripe from cached list to handle_list or
>> + * r5c_priority_list
> 
> What's the r5c_priority_list for? If you want to make sure reclaim makes
> progress, I think it's the wrong way. If there are no spaces, handling other
> normal stripes will mean moving them to no_space list and do nothing else. Then
> the reclaim stripes will get the turn to run. There is no extra list required.

I will try to remove it. 

> 
>> 
>> +	BUG_ON(test_bit(STRIPE_R5C_PRIORITY, &sh->state) &&
>> +	       !test_bit(STRIPE_HANDLE, &sh->state));
>> +
>> +	if (test_bit(STRIPE_R5C_PRIORITY, &sh->state))
>> +		return 0;
>> +	if (test_bit(STRIPE_HANDLE, &sh->state) && !priority)
>> +		return 0;
>> +
>> 	r5c_freeze_stripe_for_reclaim(sh);
>> -	atomic_inc(&conf->active_stripes);
>> +	if (!test_and_set_bit(STRIPE_HANDLE, &sh->state)) {
>> +		atomic_inc(&conf->active_stripes);
>> +	}
> 
> shouldn't the stripe is always accounted to active before it's reclaimed? Do we
> decrease the count before the stripe is reclaimed? sounds like a bug.
> 

We decrease active count when the stripe is on r5c_cached_full_stripe or 
r5c_cached_partial stripe. These two lists are variation of inactive_list. 

>> 
>> +
>> +			mutex_unlock(&log->io_mutex);
>> +			return -ENOSPC;
>> 		}
>> +		pr_debug("%s: write sh %lu to free log space\n", __func__, sh->sector);
>> +	}
>> +	if (!r5l_has_free_space(log, reserve)) {
>> +		pr_err("%s: cannot reserve space %d\n", __func__, reserve);
>> +		BUG();
> 
> same here. we should put the stripe into no_space list. If we can't allocate
> space eventually, it indicates reclaim has bug.

BUG() here already indicates bug in reclaim. I will test this thoroughly with new version. 

>> 
>> +				if (before_jiffies > 20)
>> +					pr_debug("%s: wait for sh takes %lu jiffies\n", __func__, before_jiffies);
> please remove the debug code.
> 
> Thanks,
> Shaohua

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html