Re: [RFC] Live resize of backing device

Coly Li <colyli@xxxxxxx> · Fri, 30 Dec 2022 18:40:47 +0800

On 9/8/22 4:32 PM, Andrea Tomassetti wrote:
From 59787372cf21af0b79e895578ae05b6586dfeb09 Mon Sep 17 00:00:00 2001
From: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
Date: Thu, 8 Sep 2022 09:47:55 +0200
Subject: [PATCH] bcache: Add support for live resize of backing devices

Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>

Hi Andrea,

I am just recovered from Omicron, and able to reply email. Let me place 
my comments inline.


---
Hi Coly,
Here is the first version of the patch. There are some points I noted 
down
that I would like to discuss with you:
 - I found it pretty convenient to hook the call of the new added 
function
   inside the `register_bcache`. In fact, every time (at least from my
   understandings) a disk changes size, it will trigger a new probe and,
   thus, `register_bcache` will be triggered. The only inconvenient
   is that, in case of success, the function will output

The resize should be triggered manually, and not to do it automatically.

You may create a sysfs file under the cached device's directory, name it 
as "extend_size" or something else you think better.

Then the sysadmin may extend the cached device size explicitly on a 
predictable time.

`error: capacity changed` even if it's not really an error.
 - I'm using `kvrealloc`, introduced in kernel version 5.15, to resize
   `stripe_sectors_dirty` and `full_dirty_stripes`. It shouldn't be a
   problem, right?
 - There is some reused code between this new function and
   `bcache_device_init`. Maybe I can move `const size_t max_stripes` to
   a broader scope or make it a macro, what do you think?

Thank you very much,
Andrea

 drivers/md/bcache/super.c | 75 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 74 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index ba3909bb6bea..9a77caf2a18f 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2443,6 +2443,76 @@ static bool bch_is_open(dev_t dev)
     return bch_is_open_cache(dev) || bch_is_open_backing(dev);
 }

+static bool bch_update_capacity(dev_t dev)
+{
+    const size_t max_stripes = min_t(size_t, INT_MAX,
+                     SIZE_MAX / sizeof(atomic_t));
+
+    uint64_t n, n_old;
+    int nr_stripes_old;
+    bool res = false;
+
+    struct bcache_device *d;
+    struct cache_set *c, *tc;
+    struct cached_dev *dcp, *t, *dc = NULL;
+
+    uint64_t parent_nr_sectors;
+
+    list_for_each_entry_safe(c, tc, &bch_cache_sets, list)
+        list_for_each_entry_safe(dcp, t, &c->cached_devs, list)
+            if (dcp->bdev->bd_dev == dev) {
+                dc = dcp;
+                goto dc_found;
+            }
+
+dc_found:
+    if (!dc)
+        return false;
+
+    parent_nr_sectors = bdev_nr_sectors(dc->bdev) - dc->sb.data_offset;
+
+    if (parent_nr_sectors == bdev_nr_sectors(dc->disk.disk->part0))
+        return false;
+

The above code only handles whole disk using as cached device. If a 
partition of a hard drive is used as cache device, and there are other 
data after this partition, such condition should be handled as well. So 
far I am fine to only extend size when using the whole hard drive as 
cached device, but more code is necessary to check and only permits 
size-extend for such condition.

+    if (!set_capacity_and_notify(dc->disk.disk, parent_nr_sectors))
+        return false;

The above code should be done when all rested are set.


+
+    d = &dc->disk;
+
+    /* Force cached device sectors re-calc */
+    calc_cached_dev_sectors(d->c);

Here c->cached_dev_sectors might be changed, if any of the following 
steps fails, it should be restored to previous value.


+
+    /* Block writeback thread */
+    down_write(&dc->writeback_lock);
+    nr_stripes_old = d->nr_stripes;
+    n = DIV_ROUND_UP_ULL(parent_nr_sectors, d->stripe_size);
+    if (!n || n > max_stripes) {
+        pr_err("nr_stripes too large or invalid: %llu (start sector 
beyond end of disk?)\n",
+            n);
+        goto unblock_and_exit;
+    }
+    d->nr_stripes = n;
+
+    n = d->nr_stripes * sizeof(atomic_t);
+    n_old = nr_stripes_old * sizeof(atomic_t);
+    d->stripe_sectors_dirty = kvrealloc(d->stripe_sectors_dirty, n_old,
+        n, GFP_KERNEL);
+    if (!d->stripe_sectors_dirty)
+        goto unblock_and_exit;
+
+    n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
+    n_old = BITS_TO_LONGS(nr_stripes_old) * sizeof(unsigned long);
+    d->full_dirty_stripes = kvrealloc(d->full_dirty_stripes, n_old, 
n, GFP_KERNEL);
+    if (!d->full_dirty_stripes)
+        goto unblock_and_exit;

If kvrealloc() fails and NULL set to d->full_dirty_sripes, the previous 
array should be restored.

+
+    res = true;
+
+unblock_and_exit:
+    up_write(&dc->writeback_lock);
+    return res;
+}
+

I didn't test the patch, from the first glance, I feel the failure 
handling should restore all previous values, otherwise the cache device 
may be in a non-consistent state when extend size fails.


 struct async_reg_args {
     struct delayed_work reg_work;
     char *path;
@@ -2569,7 +2639,10 @@ static ssize_t register_bcache(struct kobject 
*k, struct kobj_attribute *attr,
             mutex_lock(&bch_register_lock);
             if (lookup_bdev(strim(path), &dev) == 0 &&
                 bch_is_open(dev))
-                err = "device already registered";
+                if (bch_update_capacity(dev))
+                    err = "capacity changed";
+                else
+                    err = "device already registered";


As I said, it should be a separated write-only sysfile under the cache 
device's directory.


else
                 err = "device busy";
             mutex_unlock(&bch_register_lock);
--
2.37.3