Re: [RFC] Live resize of backing device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 27 Jan 2023, Andrea Tomassetti wrote:

> From 83f490ec8e81c840bdaf69e66021d661751975f2 Mon Sep 17 00:00:00 2001
> From: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
> Date: Thu, 8 Sep 2022 09:47:55 +0200
> Subject: [PATCH v2] bcache: Add support for live resize of backing devices
> 
> Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
> ---
> Hi Coly,
> this is the second version of the patch. As you correctly pointed out,
> I implemented roll-back functionalities in case of error.
> I'm testing this funcionality using QEMU/KVM vm via libvirt.
> Here the steps:
>   1. make-bcache --writeback -B /dev/vdb -C /dev/vdc
>   2. mkfs.xfs /dev/bcache0
>   3. mount /dev/bcache0 /mnt
>   3. dd if=/dev/random of=/mnt/random0 bs=1M count=1000
>   4. md5sum /mnt/random0 | tee /mnt/random0.md5
>   5. [HOST] virsh blockresize <vm-name> --path <disk-path> --size 
> <new-size>
>   6. xfs_growfs /dev/bcache0
>   6. Repeat steps 3 and 4 with a different file name (e.g. random1.md5)
>   7. umount/reboot/remount and check that the md5 hashes are correct with
>         md5sum -c /mnt/random?.md5
> 
>  drivers/md/bcache/super.c | 84 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index ba3909bb6bea..1435a3f605f8 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2443,6 +2443,85 @@ static bool bch_is_open(dev_t dev)
>  	return bch_is_open_cache(dev) || bch_is_open_backing(dev);
>  }
> 
> +static bool bch_update_capacity(dev_t dev)
> +{
> +	const size_t max_stripes = min_t(size_t, INT_MAX,
> +					 SIZE_MAX / sizeof(atomic_t));
> +
> +	uint64_t n, n_old, orig_cached_sectors = 0;
> +	void *tmp_realloc;
> +
> +	int nr_stripes_old;
> +	bool res = false;
> +
> +	struct bcache_device *d;
> +	struct cache_set *c, *tc;
> +	struct cached_dev *dcp, *t, *dc = NULL;
> +
> +	uint64_t parent_nr_sectors;
> +
> +	list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
> +		list_for_each_entry_safe(dcp, t, &c->cached_devs, list)
> +			if (dcp->bdev->bd_dev == dev) {
> +				dc = dcp;
> +				goto dc_found;
> +			}
> +
> +dc_found:
> +	if (!dc)
> +		return false;
> +
> +	parent_nr_sectors = bdev_nr_sectors(dc->bdev) - dc->sb.data_offset;
> +
> +	if (parent_nr_sectors == bdev_nr_sectors(dc->disk.disk->part0))
> +		return false;
> +
> +	d = &dc->disk;
> +	orig_cached_sectors = d->c->cached_dev_sectors;
> +
> +	/* Force cached device sectors re-calc */
> +	calc_cached_dev_sectors(d->c);
> +
> +	/* Block writeback thread */
> +	down_write(&dc->writeback_lock);
> +	nr_stripes_old = d->nr_stripes;
> +	n = DIV_ROUND_UP_ULL(parent_nr_sectors, d->stripe_size);
> +	if (!n || n > max_stripes) {
> +		pr_err("nr_stripes too large or invalid: %llu (start sector
> beyond end of disk?)\n",
> +			n);
> +		goto restore_dev_sectors;
> +	}
> +	d->nr_stripes = n;

It looks like there is some overlap between the new bch_update_capacity() 
function and the existing bcache_device_init() function:

	https://github.com/torvalds/linux/blob/master/drivers/md/bcache/super.c#L909

IMHO, it would be ideal if bch_update_capacity() could also handle the 
uninitialized state of full_dirty_stripes/stripe_sectors_dirty (and any 
related bits) at bdev registration time so bcache_device_init() can call 
bch_update_capacity().  Thus, bch_update_capacity() would replace the 
set_capacity() call in bcache_device_init().

If a call to bch_update_capacity() can set the size at registration or 
resize as necessary then it would remove duplicate code and keep 
initialization in only one place.

I'll defer to what Coly thinks is best, just my $0.02 

Cheers,


--
Eric Wheeler



> +
> +	n = d->nr_stripes * sizeof(atomic_t);
> +	n_old = nr_stripes_old * sizeof(atomic_t);
> +	tmp_realloc = kvrealloc(d->stripe_sectors_dirty, n_old,
> +		n, GFP_KERNEL);
> +	if (!tmp_realloc)
> +		goto restore_nr_stripes;
> +
> +	d->stripe_sectors_dirty = (atomic_t *) tmp_realloc;
> +
> +	n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
> +	n_old = BITS_TO_LONGS(nr_stripes_old) * sizeof(unsigned long);
> +	tmp_realloc = kvrealloc(d->full_dirty_stripes, n_old, n, GFP_KERNEL);
> +	if (!tmp_realloc)
> +		goto restore_nr_stripes;
> +
> +	d->full_dirty_stripes = (unsigned long *) tmp_realloc;
> +
> +	if ((res = set_capacity_and_notify(dc->disk.disk, parent_nr_sectors)))
> +		goto unblock_and_exit;
> +
> +restore_nr_stripes:
> +	d->nr_stripes = nr_stripes_old;
> +restore_dev_sectors:
> +	d->c->cached_dev_sectors = orig_cached_sectors;
> +unblock_and_exit:
> +	up_write(&dc->writeback_lock);
> +	return res;
> +}
> +
>  struct async_reg_args {
>  	struct delayed_work reg_work;
>  	char *path;
> @@ -2569,7 +2648,10 @@ static ssize_t register_bcache(struct kobject *k,
> struct kobj_attribute *attr,
>  			mutex_lock(&bch_register_lock);
>  			if (lookup_bdev(strim(path), &dev) == 0 &&
>  			    bch_is_open(dev))
> -				err = "device already registered";
> +				if (bch_update_capacity(dev))
> +					err = "capacity changed";
> +				else
> +					err = "device already registered";
>  			else
>  				err = "device busy";
>  			mutex_unlock(&bch_register_lock);
> --
> 2.39.0
> 
> 
> 
> On 25/1/23 18:59, Coly Li wrote:
> > 
> > 
> >> 2023年1月25日 18:07,Andrea Tomassetti
> >> <andrea.tomassetti-opensource@xxxxxxxx> 写道:
> >>
> >> On Tue, Jan 17, 2023 at 5:18 PM Coly Li <colyli@xxxxxxx> wrote:
> >>>>
> > 
> >>>>>
> >>>>>> struct async_reg_args {
> >>>>>>     struct delayed_work reg_work;
> >>>>>>     char *path;
> >>>>>> @@ -2569,7 +2639,10 @@ static ssize_t register_bcache(struct kobject
> >>>>>> *k, struct kobj_attribute *attr,
> >>>>>>             mutex_lock(&bch_register_lock);
> >>>>>>             if (lookup_bdev(strim(path), &dev) == 0 &&
> >>>>>>                 bch_is_open(dev))
> >>>>>> -                err = "device already registered";
> >>>>>> +                if (bch_update_capacity(dev))
> >>>>>> +                    err = "capacity changed";
> >>>>>> +                else
> >>>>>> +                    err = "device already registered";
> >>>>>
> >>>>>
> >>>>> As I said, it should be a separated write-only sysfile under the cache
> >>>>> device's directory.
> >>>> Can I ask why you don't like the automatic resize way? Why should the
> >>>> resize be manual?
> >>>
> >>> Most of system administrators don’t like such silently automatic things.
> >>> They want to extend the size explicitly, especially when there is other
> >>> dependences in their configurations.
> >>>
> >> What I was trying to say is that, in order to resize a block device, a
> >> manual command should be executed. So, this is already a "non-silent"
> >> automatic thing.
> >> Moreover, if the block device has a FS on it, the FS needs to be
> >> manually grown with some special utilities, e.g. xfs_growfs. So,
> >> again, another non-silent automatic step. Don't you agree?
> >> For example, to resize a qcow device attached to a VM I'm manually
> >> doing a `virsh blockresize`. As soon as I issue that command, the
> >> virtio_blk driver inside the VM detects the disk size change and calls
> >> the `set_capacity_and_notify` function. Why then should bcache behave
> >> differently?
> > 
> > The above VM example makes sense, I am almost convinced.
> > 
> >>
> >> If you're concerned that this can somehow break the
> >> behaviour-compatibility with older versions of the driver, can we
> >> protect this automatic discovery with an optional parameter? Will this
> >> be an option you will take into account?
> > 
> > Then let’s forget the option sysfs at this moment. Once you feel the patch
> > is ready for me to testing, please notice me with detailed steps to redo
> > your testing.
> > At that time during my testing, let’s discuss whether an extra option is
> > necesssary, for now just keep your idea as automatically resize the cached
> > device.
> > 
> > Thanks for your detailed explanation.
> > 
> > Coly Li
> > 
> 
> 

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux