Re: [RFC] Live resize of backing device

Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> · Wed, 25 Jan 2023 11:07:07 +0100

On Tue, Jan 17, 2023 at 5:18 PM Coly Li <colyli@xxxxxxx> wrote:
>
>
>
> > 2023年1月12日 00:01，Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> 写道：
> >
> > Hi Coly,
> > thank you for taking the time to reply, I really hope you are feeling
> > better now.
>
> Thanks. But the recovery is really slow, and my response cannot be in time. I was told it might be better after 1 month, hope it is true.
>
> >
> > On Fri, Dec 30, 2022 at 11:41 AM Coly Li <colyli@xxxxxxx> wrote:
> >>
> >> On 9/8/22 4:32 PM, Andrea Tomassetti wrote:
> >>> From 59787372cf21af0b79e895578ae05b6586dfeb09 Mon Sep 17 00:00:00 2001
> >>> From: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
> >>> Date: Thu, 8 Sep 2022 09:47:55 +0200
> >>> Subject: [PATCH] bcache: Add support for live resize of backing devices
> >>>
> >>> Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
> >>
> >> Hi Andrea,
> >>
> >> I am just recovered from Omicron, and able to reply email. Let me place
> >> my comments inline.
> >>
> >>
> >>> ---
> >>> Hi Coly,
> >>> Here is the first version of the patch. There are some points I noted
> >>> down
> >>> that I would like to discuss with you:
> >>> - I found it pretty convenient to hook the call of the new added
> >>> function
> >>>   inside the `register_bcache`. In fact, every time (at least from my
> >>>   understandings) a disk changes size, it will trigger a new probe and,
> >>>   thus, `register_bcache` will be triggered. The only inconvenient
> >>>   is that, in case of success, the function will output
> >>
> >> The resize should be triggered manually, and not to do it automatically.
> >>
> >> You may create a sysfs file under the cached device's directory, name it
> >> as "extend_size" or something else you think better.
> >>
> >> Then the sysadmin may extend the cached device size explicitly on a
> >> predictable time.
> >>
> >>> `error: capacity changed` even if it's not really an error.
> >>> - I'm using `kvrealloc`, introduced in kernel version 5.15, to resize
> >>>   `stripe_sectors_dirty` and `full_dirty_stripes`. It shouldn't be a
> >>>   problem, right?
> >>> - There is some reused code between this new function and
> >>>   `bcache_device_init`. Maybe I can move `const size_t max_stripes` to
> >>>   a broader scope or make it a macro, what do you think?
> >>>
> >>> Thank you very much,
> >>> Andrea
> >>>
> >>> drivers/md/bcache/super.c | 75 ++++++++++++++++++++++++++++++++++++++-
> >>> 1 file changed, 74 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> >>> index ba3909bb6bea..9a77caf2a18f 100644
> >>> --- a/drivers/md/bcache/super.c
> >>> +++ b/drivers/md/bcache/super.c
> >>> @@ -2443,6 +2443,76 @@ static bool bch_is_open(dev_t dev)
> >>>     return bch_is_open_cache(dev) || bch_is_open_backing(dev);
> >>> }
> >>>
> >>> +static bool bch_update_capacity(dev_t dev)
> >>> +{
> >>> +    const size_t max_stripes = min_t(size_t, INT_MAX,
> >>> +                     SIZE_MAX / sizeof(atomic_t));
> >>> +
> >>> +    uint64_t n, n_old;
> >>> +    int nr_stripes_old;
> >>> +    bool res = false;
> >>> +
> >>> +    struct bcache_device *d;
> >>> +    struct cache_set *c, *tc;
> >>> +    struct cached_dev *dcp, *t, *dc = NULL;
> >>> +
> >>> +    uint64_t parent_nr_sectors;
> >>> +
> >>> +    list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
> >>> +        list_for_each_entry_safe(dcp, t, &c->cached_devs, list)
> >>> +            if (dcp->bdev->bd_dev == dev) {
> >>> +                dc = dcp;
> >>> +                goto dc_found;
> >>> +            }
> >>> +
> >>> +dc_found:
> >>> +    if (!dc)
> >>> +        return false;
> >>> +
> >>> +    parent_nr_sectors = bdev_nr_sectors(dc->bdev) - dc->sb.data_offset;
> >>> +
> >>> +    if (parent_nr_sectors == bdev_nr_sectors(dc->disk.disk->part0))
> >>> +        return false;
> >>> +
> >>
> >> The above code only handles whole disk using as cached device. If a
> >> partition of a hard drive is used as cache device, and there are other
> >> data after this partition, such condition should be handled as well. So
> >> far I am fine to only extend size when using the whole hard drive as
> >> cached device, but more code is necessary to check and only permits
> >> size-extend for such condition.
> > I don't understand if there's a misalignment here so let me be more
> > clear: this patch is intended to support the live resize of *backing
> > devices*, is this what you mean with "cached device"?
>
> Yes, backing device is cached device.
>
>
> > When I was working on the patch I didn't consider the possibility of
> > live resizing the cache devices, but it should be trivial.
> > So, as far as I understand, a partition cannot be used as a backing
> > device, correct? The whole disk should be used as a backing device,
> > that's why I'm checking and that's why this check should be correct.
> > Am I wrong?
>
> Yes a partition can be used as a backing (cached) device.
> That is to say, if a file system is format on top of the cached device, this file system cannot be directly accessed via the partition. It has to be accessed via the bcache device e.g. /dev/bcache0.
>
>
> >
> >>
> >>> +    if (!set_capacity_and_notify(dc->disk.disk, parent_nr_sectors))
> >>> +        return false;
> >>
> >> The above code should be done when all rested are set.
> >>
> >>
> >>> +
> >>> +    d = &dc->disk;
> >>> +
> >>> +    /* Force cached device sectors re-calc */
> >>> +    calc_cached_dev_sectors(d->c);
> >>
> >> Here c->cached_dev_sectors might be changed, if any of the following
> >> steps fails, it should be restored to previous value.
> >>
> >>
> >>> +
> >>> +    /* Block writeback thread */
> >>> +    down_write(&dc->writeback_lock);
> >>> +    nr_stripes_old = d->nr_stripes;
> >>> +    n = DIV_ROUND_UP_ULL(parent_nr_sectors, d->stripe_size);
> >>> +    if (!n || n > max_stripes) {
> >>> +        pr_err("nr_stripes too large or invalid: %llu (start sector
> >>> beyond end of disk?)\n",
> >>> +            n);
> >>> +        goto unblock_and_exit;
> >>> +    }
> >>> +    d->nr_stripes = n;
> >>> +
> >>> +    n = d->nr_stripes * sizeof(atomic_t);
> >>> +    n_old = nr_stripes_old * sizeof(atomic_t);
> >>> +    d->stripe_sectors_dirty = kvrealloc(d->stripe_sectors_dirty, n_old,
> >>> +        n, GFP_KERNEL);
> >>> +    if (!d->stripe_sectors_dirty)
> >>> +        goto unblock_and_exit;
> >>> +
> >>> +    n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
> >>> +    n_old = BITS_TO_LONGS(nr_stripes_old) * sizeof(unsigned long);
> >>> +    d->full_dirty_stripes = kvrealloc(d->full_dirty_stripes, n_old,
> >>> n, GFP_KERNEL);
> >>> +    if (!d->full_dirty_stripes)
> >>> +        goto unblock_and_exit;
> >>
> >> If kvrealloc() fails and NULL set to d->full_dirty_sripes, the previous
> >> array should be restored.
> >>
> >>> +
> >>> +    res = true;
> >>> +
> >>> +unblock_and_exit:
> >>> +    up_write(&dc->writeback_lock);
> >>> +    return res;
> >>> +}
> >>> +
> >>
> >> I didn't test the patch, from the first glance, I feel the failure
> >> handling should restore all previous values, otherwise the cache device
> >> may be in a non-consistent state when extend size fails.
> >>
> > Completely agree with you on this point. I changed it locally and, as
> > soon as we agree on the other points I'll send a newer version of the
> > patch.
> >>
> >>> struct async_reg_args {
> >>>     struct delayed_work reg_work;
> >>>     char *path;
> >>> @@ -2569,7 +2639,10 @@ static ssize_t register_bcache(struct kobject
> >>> *k, struct kobj_attribute *attr,
> >>>             mutex_lock(&bch_register_lock);
> >>>             if (lookup_bdev(strim(path), &dev) == 0 &&
> >>>                 bch_is_open(dev))
> >>> -                err = "device already registered";
> >>> +                if (bch_update_capacity(dev))
> >>> +                    err = "capacity changed";
> >>> +                else
> >>> +                    err = "device already registered";
> >>
> >>
> >> As I said, it should be a separated write-only sysfile under the cache
> >> device's directory.
> > Can I ask why you don't like the automatic resize way? Why should the
> > resize be manual?
>
> Most of system administrators don’t like such silently automatic things. They want to extend the size explicitly, especially when there is other dependences in their configurations.
>
What I was trying to say is that, in order to resize a block device, a
manual command should be executed. So, this is already a "non-silent"
automatic thing.
Moreover, if the block device has a FS on it, the FS needs to be
manually grown with some special utilities, e.g. xfs_growfs. So,
again, another non-silent automatic step. Don't you agree?
For example, to resize a qcow device attached to a VM I'm manually
doing a `virsh blockresize`. As soon as I issue that command, the
virtio_blk driver inside the VM detects the disk size change and calls
the `set_capacity_and_notify` function. Why then should bcache behave
differently?

If you're concerned that this can somehow break the
behaviour-compatibility with older versions of the driver, can we
protect this automatic discovery with an optional parameter? Will this
be an option you will take into account?

Thank you very much,
Andrea
>
> > Someone needs to trigger the block device resize, so shouldn't that be enough?
>
> Yes, an explicit write operation on a sysfs file to trigger the resize. Then we can permit people to do the size extend explicit and avoid to change current behavior.
>
> Thanks.
>
> Coly Li
>
> >
> > Andrea
> >>
> >>
> >>> else
> >>>                 err = "device busy";
> >>>             mutex_unlock(&bch_register_lock);
> >>> --
> >>> 2.37.3
> >>>
> >>
>