Re: [PATCH v7 1/9] bcache: fix cached_dev->count usage for bch_cache_set_error()

Michael Lyle <mlyle@xxxxxxxx> · Tue, 27 Feb 2018 10:01:16 -0800

Hi Coly Li---

Thanks for this.  I've been uncomfortable with the interaction between
the dirty status and the refcount (even aside from this issue), and I
believe you've resolved it.  I'm sorry for the slow review-- it's taken
me some time to convince myself that this is safe.

I'm getting closer to be able to apply these for 4.17.

Reviewed-by: Michael Lyle <mlyle@xxxxxxxx>

Mike

On 02/27/2018 08:55 AM, Coly Li wrote:
> When bcache metadata I/O fails, bcache will call bch_cache_set_error()
> to retire the whole cache set. The expected behavior to retire a cache
> set is to unregister the cache set, and unregister all backing device
> attached to this cache set, then remove sysfs entries of the cache set
> and all attached backing devices, finally release memory of structs
> cache_set, cache, cached_dev and bcache_device.
> 
> In my testing when journal I/O failure triggered by disconnected cache
> device, sometimes the cache set cannot be retired, and its sysfs
> entry /sys/fs/bcache/<uuid> still exits and the backing device also
> references it. This is not expected behavior.
> 
> When metadata I/O failes, the call senquence to retire whole cache set is,
>         bch_cache_set_error()
>         bch_cache_set_unregister()
>         bch_cache_set_stop()
>         __cache_set_unregister()     <- called as callback by calling
>                                         clousre_queue(&c->caching)
>         cache_set_flush()            <- called as a callback when refcount
>                                         of cache_set->caching is 0
>         cache_set_free()             <- called as a callback when refcount
>                                         of catch_set->cl is 0
>         bch_cache_set_release()      <- called as a callback when refcount
>                                         of catch_set->kobj is 0
> 
> I find if kernel thread bch_writeback_thread() quits while-loop when
> kthread_should_stop() is true and searched_full_index is false, clousre
> callback cache_set_flush() set by continue_at() will never be called. The
> result is, bcache fails to retire whole cache set.
> 
> cache_set_flush() will be called when refcount of closure c->caching is 0,
> and in function bcache_device_detach() refcount of closure c->caching is
> released to 0 by clousre_put(). In metadata error code path, function
> bcache_device_detach() is called by cached_dev_detach_finish(). This is a
> callback routine being called when cached_dev->count is 0. This refcount
> is decreased by cached_dev_put().
> 
> The above dependence indicates, cache_set_flush() will be called when
> refcount of cache_set->cl is 0, and refcount of cache_set->cl to be 0
> when refcount of cache_dev->count is 0.
> 
> The reason why sometimes cache_dev->count is not 0 (when metadata I/O fails
> and bch_cache_set_error() called) is, in bch_writeback_thread(), refcount
> of cache_dev is not decreased properly.
> 
> In bch_writeback_thread(), cached_dev_put() is called only when
> searched_full_index is true and cached_dev->writeback_keys is empty, a.k.a
> there is no dirty data on cache. In most of run time it is correct, but
> when bch_writeback_thread() quits the while-loop while cache is still
> dirty, current code forget to call cached_dev_put() before this kernel
> thread exits. This is why sometimes cache_set_flush() is not executed and
> cache set fails to be retired.
> 
> The reason to call cached_dev_put() in bch_writeback_rate() is, when the
> cache device changes from clean to dirty, cached_dev_get() is called, to
> make sure during writeback operatiions both backing and cache devices
> won't be released.
> 
> Adding following code in bch_writeback_thread() does not work,
>    static int bch_writeback_thread(void *arg)
>         }
> 
> +       if (atomic_read(&dc->has_dirty))
> +               cached_dev_put()
> +
>         return 0;
>  }
> because writeback kernel thread can be waken up and start via sysfs entry:
>         echo 1 > /sys/block/bcache<N>/bcache/writeback_running
> It is difficult to check whether backing device is dirty without race and
> extra lock. So the above modification will introduce potential refcount
> underflow in some conditions.
> 
> The correct fix is, to take cached dev refcount when creating the kernel
> thread, and put it before the kernel thread exits. Then bcache does not
> need to take a cached dev refcount when cache turns from clean to dirty,
> or to put a cached dev refcount when cache turns from ditry to clean. The
> writeback kernel thread is alwasy safe to reference data structure from
> cache set, cache and cached device (because a refcount of cache device is
> taken for it already), and no matter the kernel thread is stopped by I/O
> errors or system reboot, cached_dev->count can always be used correctly.
> 
> The patch is simple, but understanding how it works is quite complicated.
> 
> Changelog:
> v2: set dc->writeback_thread to NULL in this patch, as suggested by Hannes.
> v1: initial version for review.
> 
> Signed-off-by: Coly Li <colyli@xxxxxxx>
> Reviewed-by: Hannes Reinecke <hare@xxxxxxxx>
> Cc: Michael Lyle <mlyle@xxxxxxxx>
> Cc: Junhui Tang <tang.junhui@xxxxxxxxxx>
> ---
>  drivers/md/bcache/super.c     |  1 -
>  drivers/md/bcache/writeback.c | 11 ++++++++---
>  drivers/md/bcache/writeback.h |  2 --
>  3 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 312895788036..9b745c5c1980 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -1054,7 +1054,6 @@ int bch_cached_dev_attach(struct cached_dev *dc, struct cache_set *c,
>  	if (BDEV_STATE(&dc->sb) == BDEV_STATE_DIRTY) {
>  		bch_sectors_dirty_init(&dc->disk);
>  		atomic_set(&dc->has_dirty, 1);
> -		refcount_inc(&dc->count);
>  		bch_writeback_queue(dc);
>  	}
>  
> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> index f1d2fc15abcc..b280c134dd4d 100644
> --- a/drivers/md/bcache/writeback.c
> +++ b/drivers/md/bcache/writeback.c
> @@ -572,7 +572,7 @@ static int bch_writeback_thread(void *arg)
>  
>  			if (kthread_should_stop()) {
>  				set_current_state(TASK_RUNNING);
> -				return 0;
> +				break;
>  			}
>  
>  			schedule();
> @@ -585,7 +585,6 @@ static int bch_writeback_thread(void *arg)
>  		if (searched_full_index &&
>  		    RB_EMPTY_ROOT(&dc->writeback_keys.keys)) {
>  			atomic_set(&dc->has_dirty, 0);
> -			cached_dev_put(dc);
>  			SET_BDEV_STATE(&dc->sb, BDEV_STATE_CLEAN);
>  			bch_write_bdev_super(dc, NULL);
>  		}
> @@ -606,6 +605,9 @@ static int bch_writeback_thread(void *arg)
>  		}
>  	}
>  
> +	dc->writeback_thread = NULL;
> +	cached_dev_put(dc);
> +
>  	return 0;
>  }
>  
> @@ -669,10 +671,13 @@ int bch_cached_dev_writeback_start(struct cached_dev *dc)
>  	if (!dc->writeback_write_wq)
>  		return -ENOMEM;
>  
> +	cached_dev_get(dc);
>  	dc->writeback_thread = kthread_create(bch_writeback_thread, dc,
>  					      "bcache_writeback");
> -	if (IS_ERR(dc->writeback_thread))
> +	if (IS_ERR(dc->writeback_thread)) {
> +		cached_dev_put(dc);
>  		return PTR_ERR(dc->writeback_thread);
> +	}
>  
>  	schedule_delayed_work(&dc->writeback_rate_update,
>  			      dc->writeback_rate_update_seconds * HZ);
> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
> index 587b25599856..0bba8f1c6cdf 100644
> --- a/drivers/md/bcache/writeback.h
> +++ b/drivers/md/bcache/writeback.h
> @@ -105,8 +105,6 @@ static inline void bch_writeback_add(struct cached_dev *dc)
>  {
>  	if (!atomic_read(&dc->has_dirty) &&
>  	    !atomic_xchg(&dc->has_dirty, 1)) {
> -		refcount_inc(&dc->count);
> -
>  		if (BDEV_STATE(&dc->sb) != BDEV_STATE_DIRTY) {
>  			SET_BDEV_STATE(&dc->sb, BDEV_STATE_DIRTY);
>  			/* XXX: should do this synchronously */
>