Re: [PATCH] bcache: Use bcache without formatting existing device

Coly Li <colyli@xxxxxxx> · Tue, 5 Jul 2022 22:04:00 +0800

> 2022年7月5日 16:48，Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> 写道：
> 
> On Mon, Jul 4, 2022 at 5:29 PM Coly Li <colyli@xxxxxxx> wrote:
>> 
>> 
>> 
>>> 2022年7月4日 23:13，Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> 写道：
>>> 
>>> Introducing a bcache control device (/dev/bcache_ctrl)
>>> that allows communicating to the driver from user space
>>> via IOCTL. The only IOCTL commands currently implemented,
>>> receives a struct cache_sb and uses it to register the
>>> backing device without any need of formatting them.
>>> 
>>> Signed-off-by: Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx>
>>> ---
>>> Hi all,
>>> At Devo we started to think of using bcache in our production systems
>>> to boost performance. But, at the very beginning, we were faced with
>>> one annoying issue, at least for our use-case: bcache needs the backing
>>> devices to be formatted before being able to use them. What's really
>>> needed is just to wipe any FS signature out of them. This is definitely
>>> not an option we will consider in our production systems because we would
>>> need to move hundreds of terabytes of data.
>>> 
>>> To circumvent the "formatting" problem, in the past weeks I worked on some
>>> modifications to the actual bcache module. In particular, I added a bcache
>>> control device (exported to /dev/bcache_ctrl) that allows communicating to
>>> the driver from userspace via IOCTL. One of the IOCTL commands that I
>>> implemented receives a struct cache_sb and uses it to register the backing
>>> device. The modifications are really small and retro compatible. To then
>>> re-create the same configuration after every boot (because the backing
>>> devices now do not present the bcache super block anymore) I created an
>>> udev rule that invokes a python script that will re-create the same
>>> scenario based on a yaml configuration file.
>>> 
>>> I'm re-sending this patch without any confidentiality footer, sorry for that.
>> 
>> Thanks for removing that confidential and legal statement, that’s the reason I was not able to reply your email.
>> 
> Thank you for your patience and sorry for the newbie mistake.
>> Back to the patch, I don’t support this idea. For the problem you are solving, indeed people uses device mapper linear target for many years, and it works perfectly without any code modification.
>> 
>> That is, create a 8K image and set it as a loop device, then write a dm table to combine it with the existing hard drive. Then you run “bcache make -B <combined dm target>”, you will get a bcache device whose first 8K in the loop device and existing super block of the hard driver located at expected offset.
>> 
> We evaluated this option but we weren't satisfied by the outcomes for,
> at least, two main reasons: complexity and operational drawbacks.
> For the complexity side: in production we use more than 20 HD that
> need to be cached. It means we need to create 20+ header for them, 20+
> loop devices and 20+ dm linear devices. So, basically, there's a 3x
> factor for each HD that we want to cache. Moreover, we're currently
> using ephemeral disks as cache devices. In case of a machine reboot,
> ephemeral devices can get lost; at this point, there will be some trouble
> to mount the dm-linear bcache backing device because there will be no
> cache device.

OK, I get your point. It might make sense for your situation, although I know some other people use the linear dm-table to solve similar situation but may be not perfect for you.
This patch may work in your procedure, but there are following things make me uncomfortable,
- Copying a user space memory and directly using it as a complicated in-kernel memory object.
- A new IOCTL interface added, even all rested interface are sysfs based.
- Do things in kernel space while there is solution in user space.

All the above opinions are quite personal to myself, I don’t say you are wrong or I am correct. If the patch works, that’s cool and you can use it as you want, but I don’t support it to be in upstream.

> 
> For the operational drawbacks: from time to time, we exploit the
> online filesystem resize capability of XFS to increase the volume
> size. This would be at least tedious, if not nearly impossible, if the
> volume is mapped inside a dm-linear.

Currently bcache doesn’t support cache or backing device resize. I don’t see the connection between above statement and feeding an in-memory super block via IOCTL command.

>> It is unnecessary to create a new IOCTL interface, and I feel the way how it works is really unconvinced for potential security risk.
>> 
> Which potential security risks concern you?
> 

Currently we don’t check all members of struct cache_sb_disk, what we do is to simply trust bcache-tools. Create a new IOCTL interface to send a user space memory into kernel space as superblock, that needs quite a lot of checking code to make sure it won’t panic the kernel. It is possible, but it is not worthy to add so many code for the purpose, especially adding them into upstream code.

I am not able to provide an explicit example for security risk, just the new adding interface makes me not so confident.

Thanks.

Coly Li