Re: [PATCH v5 5/8] md/r5cache: reclaim support

Song Liu <songliubraving@xxxxxx> · Fri, 21 Oct 2016 21:04:21 +0000

> On Oct 18, 2016, at 7:03 PM, NeilBrown <neilb@xxxxxxxx> wrote:
> 
>> AL, 5 seconds) reclaim
>>   if there is no reclaim in the past 5 seconds.
> 
> 5 seconds is an arbitrary number.  I don't think it needs to be made
> configurable, but it might help to justify it.
> I would probably make it closer to 30 seconds.  There really isn't any
> rush.  If there is much load, it will never wait this long.  If there is
> not much load ... then it probably doesn't matter.

Yeah, 30 second should work here. 
> 
> 
>> 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (32) cached full stripes
>>   (r5c_check_cached_full_stripe)
> 
> This is another arbitrary number, but I think it needs more
> justification.  Why 32?  That's 128K per device, which isn't very much.
> It might make sense to set the batch size based on the stripe size
> (e.g. at least on stripe?) or on the cache size.

How about we use something like min(stripe_cache_size / 8, 256)? Journal space
will trigger reclaim in other conditions (#4 below). 

> 
>> 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
>> 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
>> 
>> r5c_do_reclaim() contains new logic of reclaim.
>> 
>> For stripe cache:
>> 
>> When stripe cache pressure is high (more than 3/4 stripes are cached,
> 
> You say "3/4" here bit I think the code says "1/2"
>> +	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
> But there is also
>> +	if (total_cached > conf->min_nr_stripes * 3 / 4 ||
>> +	    atomic_read(&conf->empty_inactive_list_nr) > 0)
> 
> so maybe you do mean 3/4..  Maybe some comments closer to the different
> fractions would help.

There are level of cache pressure:
high pressure: total_cached > 3/4 || empty_inactivelist_nr > 0
moderate pressure: total_cached > 1/2. 

The condition:
+	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
+	    atomic_read(&conf->empty_inactive_list_nr) > 0)

covers "either high or moderate pressure". 

>> @@ -141,6 +150,12 @@ struct r5l_log {
>> 
>> 	/* for r5c_cache */
>> 	enum r5c_state r5c_state;
>> +
>> +	/* all stripes in r5cache, in the order of seq at sh->log_start */
>> +	struct list_head stripe_in_cache_list;
>> +
>> +	spinlock_t stripe_in_cache_lock;
> 
> You use the _irqsave / _irqrestore versions of spinlock for this lock,
> but I cannot see where it is takes from interrupt-context.
> What did I miss?

Let me fix them. 

> 
>> +/* evaluate log space usage and update R5C_LOG_TIGHT and R5C_LOG_CRITICAL */
>> +static inline void r5c_update_log_state(struct r5l_log *log)
>> +{
>> +	struct r5conf *conf = log->rdev->mddev->private;
>> +	sector_t free_space = r5l_ring_distance(log, log->log_start,
>> +						log->last_checkpoint);
>> +	sector_t reclaim_space = r5c_log_required_to_flush_cache(conf);
>> +
>> +	if (!r5c_is_writeback(log))
>> +		return;
>> +	else if (free_space < 2 * reclaim_space) {
>> +		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +	} else if (free_space < 3 * reclaim_space) {
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +	} else if (free_space > 4 * reclaim_space) {
>> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	} else
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> 
> Hmmm...
> We set LOG_CRITICAL if free_space < 2*reclaim_space, else we clear it.
> We set LOG_TIGHT if free_space < 3*reclaim_space and clear it if > 4*reclaim_space
> 
> Would the code be clearer if you just made that explicit.
> 
> At the very least, change
>> +	} else if (free_space > 4 * reclaim_space) {
>> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	} else
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> 
> to
> 
>> +	} else if (free_space < 4 * reclaim_space) {
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	} else {
>> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +     }
> 
> so the order of the branches is consistent and all the tests a '<'.

I will revise the code and the comments. 

> 
>> 
>> 	BUG_ON(reclaimable < 0);
>> @@ -941,7 +1131,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>> 
>> 	mutex_lock(&log->io_mutex);
>> 	log->last_checkpoint = next_checkpoint;
>> -	log->last_cp_seq = next_cp_seq;
> 
> Why don't we update last_cp_seq here any more?

There are two reasons: 
1. We are not using last_cp_seq after recovery is done. So it is not necessary
    to update it. Shaohua had some idea to use it for reclaim (keep ordering of 
    stripes). But my initial implementation uses sh->log_start and 
    r5l_ring_distance() instead. 
2. With write back cache, last_cp_seq will be the seq of first stripe in 
    stripe_in_cache_list. To track last_cp_seq, we will need an extra first_seq_in_cache
    in stripe_head. I feel it is a waste of memory. 

> 
>> +
>> +/*
>> + * if num < 0, flush all stripes
>> + * if num == 0, flush all full stripes
>> + * if num > 0, flush all full stripes. If less than num full stripes are
>> + *             flushed, flush some partial stripes until totally num stripes are
>> + *             flushed or there is no more cached stripes.
>> + */
> 
> I'm not very keen on the idea of having "num < 0".
> 
> I'd rather the meaning was:
>  flush all full stripes, and a total of at least num stripes.
> 
> To flush all stripes, pass MAX_INT for num (or similar).

This is a great idea. I will fix it. 

Thanks,
Song

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html