Re: [PATCH] bcache: Use bcache without formatting existing device

Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> · Tue, 5 Jul 2022 15:03:19 -0700 (PDT)

On Tue, 5 Jul 2022, Coly Li wrote:
> > 2022年7月5日 16:48，Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> 写道：
> > On Mon, Jul 4, 2022 at 5:29 PM Coly Li <colyli@xxxxxxx> wrote:
> >>> 2022年7月4日 23:13，Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> 写道：
> >>> Introducing a bcache control device (/dev/bcache_ctrl) that allows 
> >>> communicating to the driver from user space via IOCTL. The only 
> >>> IOCTL commands currently implemented, receives a struct cache_sb and 
> >>> uses it to register the backing device without any need of 
> >>> formatting them.
> >>>
> >> Back to the patch, I don’t support this idea. For the problem you are 
> >> solving, indeed people uses device mapper linear target for many 
> >> years, and it works perfectly without any code modification.
> >> 
> >> That is, create a 8K image and set it as a loop device, then write a 
> >> dm table to combine it with the existing hard drive. Then you run 
> >> “bcache make -B <combined dm target>”, you will get a bcache device 
> >> whose first 8K in the loop device and existing super block of the 
> >> hard driver located at expected offset.
> >> 
> > We evaluated this option but we weren't satisfied by the outcomes for, 
> > at least, two main reasons: complexity and operational drawbacks. For 
> > the complexity side: in production we use more than 20 HD that need to 
> > be cached. It means we need to create 20+ header for them, 20+ loop 
> > devices and 20+ dm linear devices. So, basically, there's a 3x factor 
> > for each HD that we want to cache. Moreover, we're currently using 
> > ephemeral disks as cache devices. In case of a machine reboot, 
> > ephemeral devices can get lost; at this point, there will be some 
> > trouble to mount the dm-linear bcache backing device because there 
> > will be no cache device. 
>
> OK, I get your point. It might make sense for your situation, although I 
> know some other people use the linear dm-table to solve similar 
> situation but may be not perfect for you. This patch may work in your 
> procedure, but there are following things make me uncomfortable,

Coly is right: there are some issues to address.  

Coly, I do not wish to contradict you, only to add that we have had times 
where this would be useful. I like the idea, particularly avoiding placing 
dm-linear atop of the bdev which reduces the IO codepath.  Maybe there is 
an implementation that would fit everyone's needs.

For us, such a superblock-less registration could resolve two issues:

1. Most commonly we wish to add bcache to an existing device without
   re-formatting and without adding the dm-linear complexity.

2. Relatedly, IMHO, I've never liked the default location at 8k because we 
   like to align our bdev's to the RAID stride width and rarely is the 
   bdev array aligned at 8k (usually 64k or 256k for hardware RAID).  If 
   we forget to pass the --data-offset at make-bcache time then we are 
   stuck with an 8k-misalignment for the life of the device.

In #2 we usually reformat the volume to avoid dm-linear complexity (and in 
both cases we have wanted writeback cache support).  This process can take 
a while to `dd` ~30TBs of bdev on spinning disks and we have to find 
temporary disk space to move that data out and back in.

> - Copying a user space memory and directly using it as a complicated in-kernel memory object.

In the ioctl changes for bch_write_bdev_super() there does not appear to 
be a way to handle attaching caches to the running bcache.  For example:

1. Add a cacheless bcache0 volume via ioctl
2. Attach a writeback cache, write some data.
3. Unregister everything
4. Re-register the _dirty_ bdev from #1

In this scenario bcache will start "running" immediately because it 
cannot update its cset.uuid (as reported by bcache-super-show) which I 
believe is stored in cache_sb.set_uuid.  

I suppose in step #2 you could update your userspace state with the 
cset.uuid so your userspace ioctl register tool would put the cset.uuid in 
place, but this is fragile without good userspace integration.

I could imagine something like /etc/bcachetab (like crypttab or 
mdadm.conf) that maps cset UUID's to bdev UUIDs.  Then your userspace 
ioctl tool could be included in a udev script to auto-load volumes as they 
become available.

You want to make sure that when a dirty writeback-cached bdev is 
registered that it does not activate until either:

  1. The cache is attached
  2. echo 1 > /sys/block/bcache0/bcache/running

> - A new IOCTL interface added, even all rested interface are sysfs based.

True.  Perhaps something like this would be better, and it would avoid 
sharing a userspace page for blindly populating `struct cache_sb`, too:

  echo '/dev/bdev [bdev_uuid] [/dev/cachedev|cset_uuid] [options]' > \
          /sys/fs/bcache/register_raw

Because of the writeback issue above, the cache and bdev either need to be 
registered simultaneously or the cset uuid need to be specified.

> - Do things in kernel space while there is solution in user space.
> 
> All the above opinions are quite personal to myself, I don’t say you are 
> wrong or I am correct. If the patch works, that’s cool and you can use 
> it as you want, but I don’t support it to be in upstream.

An alternate implementation might be to create a dm-bcache target.  The 
core bcache code could be left alone except a few EXPORT_SYMBOL()'s so 
dm-bcache can reach the bcache registration bits.  

This would:
  * Minimally impact the bcache code base
  * Solve the blind-populating of `struct cache_sb` issue in the same way 
    as `register_raw` could work above.
  * Create a nicely segmented codebase (dm target) to upstream separately 
    through the DM team.
  * Could be maintained cleanly out-of-tree because the bcache 
    registration interfaces change very infrequently.
  * bdev resize could be done with a `dmsetup replace` but you'll need to 
    add resize support as below.

> > For the operational drawbacks: from time to time, we exploit the
> > online filesystem resize capability of XFS to increase the volume
> > size. This would be at least tedious, if not nearly impossible, if the
> > volume is mapped inside a dm-linear.
>
> Currently bcache doesn’t support cache or backing device resize. I don’t 
> see the connection between above statement and feeding an in-memory 
> super block via IOCTL command.

Resize is perhaps unrelated, so if you decide to tackle bdev or cachedev 
resize then please start a new list thread.  Briefly: supporting bdev 
resize is probably easy.  I've looked through the code a few times with 
this in mind but haven't had time.

Here's the summary, not sure if there are any other 
considerations:

  1. Create a sysfs file like /sys/block/bcache0/bcache/resize to trigger 
     resize on echo 1 >.
  2. Refactor the set_capacity() bits from  bcache_device_init() into its 
     own function.
  3. Put locks around bcache_device.full_dirty_stripes and 
     bcache_device.stripe_sectors_dirty.  Re-alloc+copy+free and zero the 
     new bytes at the end.

The cachedev's don't know anything about the bdev size, so (according to 
Kent) they will "just work" by referencing new offsets that didn't exist 
before when IOs come through.  (This is basically the same as resizing the 
bdev and then unregister/re-register which is how we resize bdevs now.)

As for resizing a cachedev, I've not looked at all---not sure about that 
one.  We always detach, resize, make-bcache and re-attach the new cache.  
Maybe it is similarly simple, but haven't looked.

> >> It is unnecessary to create a new IOCTL interface, and I feel the way 
> >> how it works is really unconvinced for potential security risk.
> >> 
> > Which potential security risks concern you?
> > 
> 
> Currently we don’t check all members of struct cache_sb_disk, what we do 
> is to simply trust bcache-tools. Create a new IOCTL interface to send a 
> user space memory into kernel space as superblock, that needs quite a 
> lot of checking code to make sure it won’t panic the kernel. It is 
> possible, but it is not worthy to add so many code for the purpose, 
> especially adding them into upstream code.
> 
> I am not able to provide an explicit example for security risk, just the 
> new adding interface makes me not so confident.

Maintaining a blind structure population from a userspace page would be 
difficult as well because even if the kernel validates _everything_ in 
cache_sb today, anyone extending `struct cache_sb` needs to remember to 
add checks for that. Failing to enforce validation inside the kernel could 
lead to kernel crashes or data corruption from userspace which is of 
course never good.

We always assume that, somehow, someone could leverage an insecure IOCTL 
call and crash the OS when they shouldn't be allowed to (even if they are 
not root, like from sudo).  This is a security issue from an assurance and 
integrity point of view, even if there might be neither an obvious 
privelege escalation nor privacy concern.

-Eric

> Thanks.
> 
> Coly Li
> 
> 
> 
>