Re: [PATCH v5 5/8] md/r5cache: reclaim support

NeilBrown <neilb@xxxxxxxx> · Wed, 19 Oct 2016 13:03:53 +1100

On Thu, Oct 13 2016, Song Liu wrote:

> There are two limited resources, stripe cache and journal disk space.
> For better performance, we priotize reclaim of full stripe writes.
> To free up more journal space, we free earliest data on the journal.
>
> In current implementation, reclaim happens when:
> 1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 5 seconds) reclaim
>    if there is no reclaim in the past 5 seconds.

5 seconds is an arbitrary number.  I don't think it needs to be made
configurable, but it might help to justify it.
I would probably make it closer to 30 seconds.  There really isn't any
rush.  If there is much load, it will never wait this long.  If there is
not much load ... then it probably doesn't matter.

> 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (32) cached full stripes
>    (r5c_check_cached_full_stripe)

This is another arbitrary number, but I think it needs more
justification.  Why 32?  That's 128K per device, which isn't very much.
It might make sense to set the batch size based on the stripe size
(e.g. at least on stripe?) or on the cache size.

> 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
> 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
>
> r5c_do_reclaim() contains new logic of reclaim.
>
> For stripe cache:
>
> When stripe cache pressure is high (more than 3/4 stripes are cached,

You say "3/4" here bit I think the code says "1/2"
> +	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
But there is also
> +	if (total_cached > conf->min_nr_stripes * 3 / 4 ||
> +	    atomic_read(&conf->empty_inactive_list_nr) > 0)

so maybe you do mean 3/4..  Maybe some comments closer to the different
fractions would help.

> @@ -141,6 +150,12 @@ struct r5l_log {
>  
>  	/* for r5c_cache */
>  	enum r5c_state r5c_state;
> +
> +	/* all stripes in r5cache, in the order of seq at sh->log_start */
> +	struct list_head stripe_in_cache_list;
> +
> +	spinlock_t stripe_in_cache_lock;

You use the _irqsave / _irqrestore versions of spinlock for this lock,
but I cannot see where it is takes from interrupt-context.
What did I miss?

> +/* evaluate log space usage and update R5C_LOG_TIGHT and R5C_LOG_CRITICAL */
> +static inline void r5c_update_log_state(struct r5l_log *log)
> +{
> +	struct r5conf *conf = log->rdev->mddev->private;
> +	sector_t free_space = r5l_ring_distance(log, log->log_start,
> +						log->last_checkpoint);
> +	sector_t reclaim_space = r5c_log_required_to_flush_cache(conf);
> +
> +	if (!r5c_is_writeback(log))
> +		return;
> +	else if (free_space < 2 * reclaim_space) {
> +		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +	} else if (free_space < 3 * reclaim_space) {
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +	} else if (free_space > 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);

Hmmm...
We set LOG_CRITICAL if free_space < 2*reclaim_space, else we clear it.
We set LOG_TIGHT if free_space < 3*reclaim_space and clear it if > 4*reclaim_space

Would the code be clearer if you just made that explicit.

At the very least, change
> +	} else if (free_space > 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);

to

> +	} else if (free_space < 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +     }

so the order of the branches is consistent and all the tests a '<'.

>  
> +/*
> + * do not release a stripe to cached lists in quiesce
> + */

You tell me what this function doesn't do, but not what it does.

This function is only called once, and I think that could would be
clearer if this was just included in-line in that one place.

> +void r5c_prepare_stripe_for_release_in_quiesce(struct stripe_head *sh)
> +{
> +	if (!test_bit(STRIPE_HANDLE, &sh->state) &&
> +	    atomic_read(&sh->dev_in_cache) != 0) {
> +		if (!test_bit(STRIPE_R5C_FROZEN, &sh->state))
> +			r5c_freeze_stripe_for_reclaim(sh);
> +		set_bit(STRIPE_HANDLE, &sh->state);
> +	}
> +}
> +
>  
> @@ -709,6 +867,7 @@ static void r5l_run_no_space_stripes(struct r5l_log *log)
>  	while (!list_empty(&log->no_space_stripes)) {
>  		sh = list_first_entry(&log->no_space_stripes,
>  				      struct stripe_head, log_list);
> +		set_bit(STRIPE_PREREAD_ACTIVE, &sh->state);
>  		list_del_init(&sh->log_list);
>  		set_bit(STRIPE_HANDLE, &sh->state);
>  		raid5_release_stripe(sh);

This chunk doesn't seem to belong here.  It looks like it might be a
bugfix that is quite independent of the rest of the code.  Is it?
If is, it probably belongs in a separate patch.

>  static void r5l_do_reclaim(struct r5l_log *log)
>  {
> +	struct r5conf *conf = log->rdev->mddev->private;
>  	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
>  	sector_t reclaimable;
>  	sector_t next_checkpoint;
> -	u64 next_cp_seq;
>  
>  	spin_lock_irq(&log->io_list_lock);
>  	/*
> @@ -924,8 +1115,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  				    log->io_list_lock);
>  	}
>  
> -	next_checkpoint = log->next_checkpoint;
> -	next_cp_seq = log->next_cp_seq;
> +	next_checkpoint = r5c_calculate_new_cp(conf);
>  	spin_unlock_irq(&log->io_list_lock);
>  
>  	BUG_ON(reclaimable < 0);
> @@ -941,7 +1131,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  
>  	mutex_lock(&log->io_mutex);
>  	log->last_checkpoint = next_checkpoint;
> -	log->last_cp_seq = next_cp_seq;

Why don't we update last_cp_seq here any more?

> +
> +/*
> + * if num < 0, flush all stripes
> + * if num == 0, flush all full stripes
> + * if num > 0, flush all full stripes. If less than num full stripes are
> + *             flushed, flush some partial stripes until totally num stripes are
> + *             flushed or there is no more cached stripes.
> + */

I'm not very keen on the idea of having "num < 0".

I'd rather the meaning was:
  flush all full stripes, and a total of at least num stripes.

To flush all stripes, pass MAX_INT for num (or similar).

> +void r5c_flush_cache(struct r5conf *conf, int num)
> +{
> +	int count;
> +	struct stripe_head *sh, *next;
> +
> +	assert_spin_locked(&conf->device_lock);
> +	if (!conf->log)
> +		return;
> +
> +	count = 0;
> +	list_for_each_entry_safe(sh, next, &conf->r5c_full_stripe_list, lru) {
> +		r5c_flush_stripe(conf, sh);
> +		count++;
> +	}
> +
> +	if (num == 0 || (num > 0 && count >= num))
> +		return;
> +	list_for_each_entry_safe(sh, next,
> +				 &conf->r5c_partial_stripe_list, lru) {
> +		r5c_flush_stripe(conf, sh);
> +		if (num > 0 && count++ >= num)
> +			break;
> +	}
> +}
> +

Thanks,
NeilBrown
Attachment:
signature.asc

Description: PGP signature