Re: [RFC] Live resize of backing device

Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> · Wed, 11 Jan 2023 17:01:12 +0100

Hi Coly,
thank you for taking the time to reply, I really hope you are feeling
better now.

On Fri, Dec 30, 2022 at 11:41 AM Coly Li <colyli@xxxxxxx> wrote:
>
> On 9/8/22 4:32 PM, Andrea Tomassetti wrote:
> > From 59787372cf21af0b79e895578ae05b6586dfeb09 Mon Sep 17 00:00:00 2001
> > From: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
> > Date: Thu, 8 Sep 2022 09:47:55 +0200
> > Subject: [PATCH] bcache: Add support for live resize of backing devices
> >
> > Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
>
> Hi Andrea,
>
> I am just recovered from Omicron, and able to reply email. Let me place
> my comments inline.
>
>
> > ---
> > Hi Coly,
> > Here is the first version of the patch. There are some points I noted
> > down
> > that I would like to discuss with you:
> >  - I found it pretty convenient to hook the call of the new added
> > function
> >    inside the `register_bcache`. In fact, every time (at least from my
> >    understandings) a disk changes size, it will trigger a new probe and,
> >    thus, `register_bcache` will be triggered. The only inconvenient
> >    is that, in case of success, the function will output
>
> The resize should be triggered manually, and not to do it automatically.
>
> You may create a sysfs file under the cached device's directory, name it
> as "extend_size" or something else you think better.
>
> Then the sysadmin may extend the cached device size explicitly on a
> predictable time.
>
> > `error: capacity changed` even if it's not really an error.
> >  - I'm using `kvrealloc`, introduced in kernel version 5.15, to resize
> >    `stripe_sectors_dirty` and `full_dirty_stripes`. It shouldn't be a
> >    problem, right?
> >  - There is some reused code between this new function and
> >    `bcache_device_init`. Maybe I can move `const size_t max_stripes` to
> >    a broader scope or make it a macro, what do you think?
> >
> > Thank you very much,
> > Andrea
> >
> >  drivers/md/bcache/super.c | 75 ++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 74 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> > index ba3909bb6bea..9a77caf2a18f 100644
> > --- a/drivers/md/bcache/super.c
> > +++ b/drivers/md/bcache/super.c
> > @@ -2443,6 +2443,76 @@ static bool bch_is_open(dev_t dev)
> >      return bch_is_open_cache(dev) || bch_is_open_backing(dev);
> >  }
> >
> > +static bool bch_update_capacity(dev_t dev)
> > +{
> > +    const size_t max_stripes = min_t(size_t, INT_MAX,
> > +                     SIZE_MAX / sizeof(atomic_t));
> > +
> > +    uint64_t n, n_old;
> > +    int nr_stripes_old;
> > +    bool res = false;
> > +
> > +    struct bcache_device *d;
> > +    struct cache_set *c, *tc;
> > +    struct cached_dev *dcp, *t, *dc = NULL;
> > +
> > +    uint64_t parent_nr_sectors;
> > +
> > +    list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
> > +        list_for_each_entry_safe(dcp, t, &c->cached_devs, list)
> > +            if (dcp->bdev->bd_dev == dev) {
> > +                dc = dcp;
> > +                goto dc_found;
> > +            }
> > +
> > +dc_found:
> > +    if (!dc)
> > +        return false;
> > +
> > +    parent_nr_sectors = bdev_nr_sectors(dc->bdev) - dc->sb.data_offset;
> > +
> > +    if (parent_nr_sectors == bdev_nr_sectors(dc->disk.disk->part0))
> > +        return false;
> > +
>
> The above code only handles whole disk using as cached device. If a
> partition of a hard drive is used as cache device, and there are other
> data after this partition, such condition should be handled as well. So
> far I am fine to only extend size when using the whole hard drive as
> cached device, but more code is necessary to check and only permits
> size-extend for such condition.
I don't understand if there's a misalignment here so let me be more
clear: this patch is intended to support the live resize of *backing
devices*, is this what you mean with "cached device"?
When I was working on the patch I didn't consider the possibility of
live resizing the cache devices, but it should be trivial.
So, as far as I understand, a partition cannot be used as a backing
device, correct? The whole disk should be used as a backing device,
that's why I'm checking and that's why this check should be correct.
Am I wrong?

>
> > +    if (!set_capacity_and_notify(dc->disk.disk, parent_nr_sectors))
> > +        return false;
>
> The above code should be done when all rested are set.
>
>
> > +
> > +    d = &dc->disk;
> > +
> > +    /* Force cached device sectors re-calc */
> > +    calc_cached_dev_sectors(d->c);
>
> Here c->cached_dev_sectors might be changed, if any of the following
> steps fails, it should be restored to previous value.
>
>
> > +
> > +    /* Block writeback thread */
> > +    down_write(&dc->writeback_lock);
> > +    nr_stripes_old = d->nr_stripes;
> > +    n = DIV_ROUND_UP_ULL(parent_nr_sectors, d->stripe_size);
> > +    if (!n || n > max_stripes) {
> > +        pr_err("nr_stripes too large or invalid: %llu (start sector
> > beyond end of disk?)\n",
> > +            n);
> > +        goto unblock_and_exit;
> > +    }
> > +    d->nr_stripes = n;
> > +
> > +    n = d->nr_stripes * sizeof(atomic_t);
> > +    n_old = nr_stripes_old * sizeof(atomic_t);
> > +    d->stripe_sectors_dirty = kvrealloc(d->stripe_sectors_dirty, n_old,
> > +        n, GFP_KERNEL);
> > +    if (!d->stripe_sectors_dirty)
> > +        goto unblock_and_exit;
> > +
> > +    n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
> > +    n_old = BITS_TO_LONGS(nr_stripes_old) * sizeof(unsigned long);
> > +    d->full_dirty_stripes = kvrealloc(d->full_dirty_stripes, n_old,
> > n, GFP_KERNEL);
> > +    if (!d->full_dirty_stripes)
> > +        goto unblock_and_exit;
>
> If kvrealloc() fails and NULL set to d->full_dirty_sripes, the previous
> array should be restored.
>
> > +
> > +    res = true;
> > +
> > +unblock_and_exit:
> > +    up_write(&dc->writeback_lock);
> > +    return res;
> > +}
> > +
>
> I didn't test the patch, from the first glance, I feel the failure
> handling should restore all previous values, otherwise the cache device
> may be in a non-consistent state when extend size fails.
>
Completely agree with you on this point. I changed it locally and, as
soon as we agree on the other points I'll send a newer version of the
patch.
>
> >  struct async_reg_args {
> >      struct delayed_work reg_work;
> >      char *path;
> > @@ -2569,7 +2639,10 @@ static ssize_t register_bcache(struct kobject
> > *k, struct kobj_attribute *attr,
> >              mutex_lock(&bch_register_lock);
> >              if (lookup_bdev(strim(path), &dev) == 0 &&
> >                  bch_is_open(dev))
> > -                err = "device already registered";
> > +                if (bch_update_capacity(dev))
> > +                    err = "capacity changed";
> > +                else
> > +                    err = "device already registered";
>
>
> As I said, it should be a separated write-only sysfile under the cache
> device's directory.
Can I ask why you don't like the automatic resize way? Why should the
resize be manual?
Someone needs to trigger the block device resize, so shouldn't that be enough?

Andrea
>
>
> > else
> >                  err = "device busy";
> >              mutex_unlock(&bch_register_lock);
> > --
> > 2.37.3
> >
>