Re: [PATCH v1 12/15] md/raid5-cache: Add RCU protection to conf->log accesses

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Sat, 21 May 2022 04:50:47 -0700

On Thu, May 19, 2022 at 01:13:08PM -0600, Logan Gunthorpe wrote:
> The mdadm test 21raid5cache randomly fails with NULL pointer accesses
> conf->log when run repeatedly. conf->log was sort of protected with
> a RCU, but most dereferences were not done with the correct functions.
> 
> Add rcu_read_locks() and rcu_access_pointers() to the appropriate
> places.
> 
> Signed-off-by: Logan Gunthorpe <logang@xxxxxxxxxxxx>
> ---
>  drivers/md/raid5-cache.c | 135 +++++++++++++++++++++++++++------------
>  drivers/md/raid5-log.h   |  14 ++--
>  drivers/md/raid5.c       |   4 +-
>  drivers/md/raid5.h       |   2 +-
>  4 files changed, 104 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index f7b402138d16..1dbc7c4b9a15 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -254,7 +254,14 @@ static bool __r5c_is_writeback(struct r5l_log *log)
>  
>  bool r5c_is_writeback(struct r5conf *conf)
>  {
> -	return __r5c_is_writeback(conf->log);
> +	struct r5l_log *log;
> +	bool ret;
> +
> +	rcu_read_lock();
> +	log = rcu_dereference(conf->log);
> +	ret = __r5c_is_writeback(log);

Nit: I'd do away with the local variable

	ret = __r5c_is_writeback(rcu_dereference(conf->log));

> +static struct r5l_log *get_log_for_io(struct r5conf *conf)
> +{
> +	/*
> +	 * rcu_dereference_protected is safe because the array will be
> +	 * quiesced before log_exit() so it can't be called while
> +	 * an IO is in progress.
> +	 */
> +	return rcu_dereference_protected(conf->log, 1);
> +}

The hardcoded one (shouldn't that be a true, btw?) kinda defeats the
purpose of rcu_dereference_protected.  But I can't really think of any
good runtime assert that we could use here.

>  void r5c_check_stripe_cache_usage(struct r5conf *conf)
>  {
> +	struct r5l_log *log = get_log_for_io(conf);
>  	int total_cached;
>  
> -	if (!r5c_is_writeback(conf))
> +	if (!__r5c_is_writeback(log))

This mostly just undoes earlier chanes.  Maybe we should have just let
r5c_is_writeback as-is and have a r5c_conf_is_writeback helper on top and
avoid this churn?  In general it would also be nice to have all these
newly added or removal local variables in place before the big fixup.

>  void r5c_check_cached_full_stripe(struct r5conf *conf)
>  {
> -	if (!r5c_is_writeback(conf))
> -		return;
> +	struct r5l_log *log = get_log_for_io(conf);

This looks odd.