Re: [RFC] Live resize of backing device

Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> · Fri, 9 Dec 2022 09:57:45 +0100

Hi Coly,
just a kind reminder for this patch.

Thank you very much,
Andrea

On Mon, Sep 19, 2022 at 2:17 PM Coly Li <colyli@xxxxxxx> wrote:
>
>
>
> > 2022年9月19日 19:42，Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> 写道：
> >
> > Hi Coly,
> > have you had any time to take a look at this? Do you prefer if I send the patch as a separate thread?
> >
> > Thank you very much,
> > Andrea
>
>
> Yes, it is on my queue, just after I finish my tasks on hand, I will take a look on it.
>
> Thanks.
>
> Coly Li
>
>
> >
> > On 8/9/22 10:32, Andrea Tomassetti wrote:
> >> From 59787372cf21af0b79e895578ae05b6586dfeb09 Mon Sep 17 00:00:00 2001
> >> From: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
> >> Date: Thu, 8 Sep 2022 09:47:55 +0200
> >> Subject: [PATCH] bcache: Add support for live resize of backing devices
> >> Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
> >> ---
> >> Hi Coly,
> >> Here is the first version of the patch. There are some points I noted down
> >> that I would like to discuss with you:
> >>  - I found it pretty convenient to hook the call of the new added function
> >>    inside the `register_bcache`. In fact, every time (at least from my
> >>    understandings) a disk changes size, it will trigger a new probe and,
> >>    thus, `register_bcache` will be triggered. The only inconvenient
> >>    is that, in case of success, the function will output
> >>    `error: capacity changed` even if it's not really an error.
> >>  - I'm using `kvrealloc`, introduced in kernel version 5.15, to resize
> >>    `stripe_sectors_dirty` and `full_dirty_stripes`. It shouldn't be a
> >>    problem, right?
> >>  - There is some reused code between this new function and
> >>    `bcache_device_init`. Maybe I can move `const size_t max_stripes` to
> >>    a broader scope or make it a macro, what do you think?
> >> Thank you very much,
> >> Andrea
> >>  drivers/md/bcache/super.c | 75 ++++++++++++++++++++++++++++++++++++++-
> >>  1 file changed, 74 insertions(+), 1 deletion(-)
> >> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> >> index ba3909bb6bea..9a77caf2a18f 100644
> >> --- a/drivers/md/bcache/super.c
> >> +++ b/drivers/md/bcache/super.c
> >> @@ -2443,6 +2443,76 @@ static bool bch_is_open(dev_t dev)
> >>      return bch_is_open_cache(dev) || bch_is_open_backing(dev);
> >>  }
> >> +static bool bch_update_capacity(dev_t dev)
> >> +{
> >> +    const size_t max_stripes = min_t(size_t, INT_MAX,
> >> +                     SIZE_MAX / sizeof(atomic_t));
> >> +
> >> +    uint64_t n, n_old;
> >> +    int nr_stripes_old;
> >> +    bool res = false;
> >> +
> >> +    struct bcache_device *d;
> >> +    struct cache_set *c, *tc;
> >> +    struct cached_dev *dcp, *t, *dc = NULL;
> >> +
> >> +    uint64_t parent_nr_sectors;
> >> +
> >> +    list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
> >> +        list_for_each_entry_safe(dcp, t, &c->cached_devs, list)
> >> +            if (dcp->bdev->bd_dev == dev) {
> >> +                dc = dcp;
> >> +                goto dc_found;
> >> +            }
> >> +
> >> +dc_found:
> >> +    if (!dc)
> >> +        return false;
> >> +
> >> +    parent_nr_sectors = bdev_nr_sectors(dc->bdev) - dc->sb.data_offset;
> >> +
> >> +    if (parent_nr_sectors == bdev_nr_sectors(dc->disk.disk->part0))
> >> +        return false;
> >> +
> >> +    if (!set_capacity_and_notify(dc->disk.disk, parent_nr_sectors))
> >> +        return false;
> >> +
> >> +    d = &dc->disk;
> >> +
> >> +    /* Force cached device sectors re-calc */
> >> +    calc_cached_dev_sectors(d->c);
> >> +
> >> +    /* Block writeback thread */
> >> +    down_write(&dc->writeback_lock);
> >> +    nr_stripes_old = d->nr_stripes;
> >> +    n = DIV_ROUND_UP_ULL(parent_nr_sectors, d->stripe_size);
> >> +    if (!n || n > max_stripes) {
> >> +        pr_err("nr_stripes too large or invalid: %llu (start sector beyond end of disk?)\n",
> >> +            n);
> >> +        goto unblock_and_exit;
> >> +    }
> >> +    d->nr_stripes = n;
> >> +
> >> +    n = d->nr_stripes * sizeof(atomic_t);
> >> +    n_old = nr_stripes_old * sizeof(atomic_t);
> >> +    d->stripe_sectors_dirty = kvrealloc(d->stripe_sectors_dirty, n_old,
> >> +        n, GFP_KERNEL);
> >> +    if (!d->stripe_sectors_dirty)
> >> +        goto unblock_and_exit;
> >> +
> >> +    n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
> >> +    n_old = BITS_TO_LONGS(nr_stripes_old) * sizeof(unsigned long);
> >> +    d->full_dirty_stripes = kvrealloc(d->full_dirty_stripes, n_old, n, GFP_KERNEL);
> >> +    if (!d->full_dirty_stripes)
> >> +        goto unblock_and_exit;
> >> +
> >> +    res = true;
> >> +
> >> +unblock_and_exit:
> >> +    up_write(&dc->writeback_lock);
> >> +    return res;
> >> +}
> >> +
> >>  struct async_reg_args {
> >>      struct delayed_work reg_work;
> >>      char *path;
> >> @@ -2569,7 +2639,10 @@ static ssize_t register_bcache(struct kobject *k, struct kobj_attribute *attr,
> >>              mutex_lock(&bch_register_lock);
> >>              if (lookup_bdev(strim(path), &dev) == 0 &&
> >>                  bch_is_open(dev))
> >> -                err = "device already registered";
> >> +                if (bch_update_capacity(dev))
> >> +                    err = "capacity changed";
> >> +                else
> >> +                    err = "device already registered";
> >>              else
> >>                  err = "device busy";
> >>              mutex_unlock(&bch_register_lock);
> >> --
> >> 2.37.3
> >> On 6/9/22 15:22, Andrea Tomassetti wrote:
> >>> Hi Coly,
> >>> I have finally some time to work on the patch. I already have a first
> >>> prototype that seems to work but, before sending it, I would like to
> >>> ask you two questions:
> >>>    1. Inspecting the code, I found that the only lock mechanism is the
> >>> writeback_lock semaphore. Am I correct?
> >>>    2. How can I effectively test my patch? So far I'm doing something like this:
> >>>       a. make-bcache --writeback -B /dev/vdb -C /dev/vdc
> >>>       b. mkfs.xfs /dev/bcache0
> >>>       c. dd if=/dev/random of=/mnt/random bs=1M count=1000
> >>>       d. md5sum /mnt/random | tee /mnt/random.md5
> >>>       e. live resize the disk and repeat c. and d.
> >>>       f. umount/reboot/remount and check that the md5 hashes are correct
> >>>
> >>> Any suggestions?
> >>>
> >>> Thank you very much,
> >>> Andrea
> >>>
> >>> On Fri, Aug 5, 2022 at 9:38 PM Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> On Wed, 3 Aug 2022, Andrea Tomassetti wrote:
> >>>>> Hi Coly,
> >>>>> In one of our previous emails you said that
> >>>>>> Currently bcache doesn’t support cache or backing device resize
> >>>>>
> >>>>> I was investigating this point and I actually found a solution. I
> >>>>> briefly tested it and it seems to work fine.
> >>>>> Basically what I'm doing is:
> >>>>>    1. Check if there's any discrepancy between the nr of sectors
> >>>>> reported by the bcache backing device (holder) and the nr of sectors
> >>>>> reported by its parent (slave).
> >>>>>    2. If the number of sectors of the two devices are not the same,
> >>>>> then call set_capacity_and_notify on the bcache device.
> >>>>>    3. From user space, depending on the fs used, grow the fs with some
> >>>>> utility (e.g. xfs_growfs)
> >>>>>
> >>>>> This works without any need of unmounting the mounted fs nor stopping
> >>>>> the bcache backing device.
> >>>>
> >>>> Well done! +1, would love to see a patch for this!
> >>>>
> >>>>
> >>>>> So my question is: am I missing something? Can this live resize cause
> >>>>> some problems (e.g. data loss)? Would it be useful if I send a patch on
> >>>>> this?
> >>>>
> >>>> A while a go we looked into doing this.  Here is the summary of our
> >>>> findings, not sure if there are any other considerations:
> >>>>
> >>>>    1. Create a sysfs file like /sys/block/bcache0/bcache/resize to trigger
> >>>>       resize on echo 1 >.
> >>>>    2. Refactor the set_capacity() bits from  bcache_device_init() into its
> >>>>       own function.
> >>>>    3. Put locks around bcache_device.full_dirty_stripes and
> >>>>       bcache_device.stripe_sectors_dirty.  Re-alloc+copy+free and zero the
> >>>>       new bytes at the end.  Grep where bcache_device.full_dirty_stripes is
> >>>>       used and make sure it is locked appropriately, probably in the
> >>>>       writeback thread, maybe other places too.
> >>>>
> >>>> The cachedev's don't know anything about the bdev size, so (according to
> >>>> Kent) they will "just work" by referencing new offsets that were
> >>>> previously beyond the disk. (This is basically the same as resizing the
> >>>> bdev and then unregister/re-register which is how we resize bdevs now.)
> >>>>
> >>>> As for resizing a cachedev, I've not looked at all---not sure about that
> >>>> one.  We always detach, resize, make-bcache and re-attach the new cache.
> >>>> Maybe it is similarly simple, but haven't looked.
> >>>>
> >>>>
> >>>> --
> >>>> Eric Wheeler
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>> Kind regards,
> >>>>> Andrea
> >>>>>
>