Re: [PATCH] bcache: Use bcache without formatting existing device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 20, 2022 at 3:31 PM Coly Li <colyli@xxxxxxx> wrote:
>
>
>
> > 2022年7月20日 16:06,Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> 写道:
> >
> > On Tue, Jul 12, 2022 at 10:29 PM Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> wrote:
> >>
> >> On Thu, 7 Jul 2022, Andrea Tomassetti wrote:
> >>> On Wed, Jul 6, 2022 at 12:03 AM Eric Wheeler <bcache@xxxxxxxxxxxxxxxxxx> wrote:
> >>>> On Tue, 5 Jul 2022, Coly Li wrote:
> >>>>>> 2022年7月5日 16:48,Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> 写道:
> >>>>>> On Mon, Jul 4, 2022 at 5:29 PM Coly Li <colyli@xxxxxxx> wrote:
> >>>>>>>> 2022年7月4日 23:13,Andrea Tomassetti <andrea.tomassetti-opensource@xxxxxxxx> 写道:
> >>>>>>>> Introducing a bcache control device (/dev/bcache_ctrl) that allows
> >>>>>>>> communicating to the driver from user space via IOCTL. The only
> >>>>>>>> IOCTL commands currently implemented, receives a struct cache_sb and
> >>>>>>>> uses it to register the backing device without any need of
> >>>>>>>> formatting them.
> >>>>>>>>
> >>>>>>> Back to the patch, I don’t support this idea. For the problem you are
> >>>>>>> solving, indeed people uses device mapper linear target for many
> >>>>>>> years, and it works perfectly without any code modification.
> >>>>>>>
> >>>>>>> That is, create a 8K image and set it as a loop device, then write a
> >>>>>>> dm table to combine it with the existing hard drive. Then you run
> >>>>>>> “bcache make -B <combined dm target>”, you will get a bcache device
> >>>>>>> whose first 8K in the loop device and existing super block of the
> >>>>>>> hard driver located at expected offset.
> >>>>>>>
> >>>>>> We evaluated this option but we weren't satisfied by the outcomes for,
> >>>>>> at least, two main reasons: complexity and operational drawbacks. For
> >>>>>> the complexity side: in production we use more than 20 HD that need to
> >>>>>> be cached. It means we need to create 20+ header for them, 20+ loop
> >>>>>> devices and 20+ dm linear devices. So, basically, there's a 3x factor
> >>>>>> for each HD that we want to cache. Moreover, we're currently using
> >>>>>> ephemeral disks as cache devices. In case of a machine reboot,
> >>>>>> ephemeral devices can get lost; at this point, there will be some
> >>>>>> trouble to mount the dm-linear bcache backing device because there
> >>>>>> will be no cache device.
> >>>>>
> >>>>> OK, I get your point. It might make sense for your situation, although I
> >>>>> know some other people use the linear dm-table to solve similar
> >>>>> situation but may be not perfect for you. This patch may work in your
> >>>>> procedure, but there are following things make me uncomfortable,
> >>>>
> >>>> Coly is right: there are some issues to address.
> >>>>
> >>>> Coly, I do not wish to contradict you, only to add that we have had times
> >>>> where this would be useful. I like the idea, particularly avoiding placing
> >>>> dm-linear atop of the bdev which reduces the IO codepath. Maybe there is
> >>>> an implementation that would fit everyone's needs.
> >>>>
> >>>> For us, such a superblock-less registration could resolve two issues:
> >>>>
> >>>> 1. Most commonly we wish to add bcache to an existing device without
> >>>> re-formatting and without adding the dm-linear complexity.
> >>>
> >>> That's exactly what was preventing us from using bcache in production
> >>> prior to this patch.
> >>
> >> Ok, but we always use writeback...and others may wish to, too. I think
> >> any patch that introduces a feature needs to support existing features
> >> without introducing limitations on the behavior.
> >>
> > Totally agree. My only point was that I extensively tested this patch
> > with wt mode. It works in wb mode as well, for sure because the
> > backing device's header is almost never used. The only issue I can
> > foresee in wb mode is in case of a machine reboot: the backing device
> > will lose the virtual header and, at boot time, another one will be
> > generated. It will get attached again to its cache device with a new
> > UID and I'm not sure if this will imply the loss of the data that was
> > not previously written to it, but was only present on the cache
> > device. But I think that losing data it's a well-known risk of wb
> > mode.
>
> NO, losing dirty data on cache device is unacceptable. If the previous attached cache device is not ready, the backing device will be suspended and its bcache device won’t show up in /dev/.
>
>
> >
> >>>> 2. Relatedly, IMHO, I've never liked the default location at 8k because we
> >>>> like to align our bdev's to the RAID stride width and rarely is the
> >>>> bdev array aligned at 8k (usually 64k or 256k for hardware RAID). If
> >>>> we forget to pass the --data-offset at make-bcache time then we are
> >>>> stuck with an 8k-misalignment for the life of the device.
> >>>>
> >>>> In #2 we usually reformat the volume to avoid dm-linear complexity (and in
> >>>> both cases we have wanted writeback cache support). This process can take
> >>>> a while to `dd` ‾30TBs of bdev on spinning disks and we have to find
> >>>> temporary disk space to move that data out and back in.
> >>>>
> >>>>> - Copying a user space memory and directly using it as a complicated in-kernel memory object.
> >>>>
> >>>> In the ioctl changes for bch_write_bdev_super() there does not appear to
> >>>> be a way to handle attaching caches to the running bcache. For example:
> >>>>
> >>>> 1. Add a cacheless bcache0 volume via ioctl
> >>>> 2. Attach a writeback cache, write some data.
> >>>> 3. Unregister everything
> >>>> 4. Re-register the _dirty_ bdev from #1
> >>>>
> >>>> In this scenario bcache will start "running" immediately because it
> >>>> cannot update its cset.uuid (as reported by bcache-super-show) which I
> >>>> believe is stored in cache_sb.set_uuid.
> >>>>
> >>>> I suppose in step #2 you could update your userspace state with the
> >>>> cset.uuid so your userspace ioctl register tool would put the cset.uuid in
> >>>> place, but this is fragile without good userspace integration.
> >>>>
> >>>> I could imagine something like /etc/bcachetab (like crypttab or
> >>>> mdadm.conf) that maps cset UUID's to bdev UUIDs. Then your userspace
> >>>> ioctl tool could be included in a udev script to auto-load volumes as they
> >>>> become available.
> >>> Yes, in conjunction with this patch, I developed a python udev script
> >>> that manages device registration base on a YAML configuration file. I
> >>> even patched bcache-tools to support the new IOCTL registration. You
> >>> will find a link to the Github project at the end of this message.
> >>
> >> Fewer dependencies are better: There are still python2 vs python3
> >> conflicts out there---and loading python and its dependencies into an
> >> initrd/initramfs for early bcache loading could be very messy, indeed!
> >>
> >> You've already put some work into make-bcache so creating a bcache_udev
> >> function and a bcache-udev.c file (like make-bcache.c) is probably easy
> >> enough. IMHO, a single-line grepable format (instead of YAML) could be
> >> used to minimize dependencies so that it is clean in an initramfs in the
> >> same way that mdadm.conf and crypttab already do. You could then parse it
> >> with C or bash pretty easily...
> >>
> > I will be really glad to rework the patch if we can agree on some
> > modifications that will make it suitable to be merged upstream.
>
Hi Coly,
thank you very much for your time.

>
> I don’t support the idea to copy a block of memory from user space to kernel space and reference it as a super block, neither IOCTL nor sysfs interface.
> It is very easy to generate a corrupted super block memory and send it into kernel space and panic the whole system, I will not take this potential risk.
>
I think I'm missing something here because I cannot see the difference
between passing the structure through the sysfs interface or reading
it from the header of the block device. In both cases the source of
such structure will be the same: the user via the make-bcache command.
My understanding of the part involved is:
    udev_rule -> bcache-register.c -> sysfs/register ->
bcache_module/register_bcache -> read_super
So, in read_super the bcache module will read what the userspace
utility make-bcache wrote as a super block. Correct?
If I'm correct, a big if, I cannot see why it should be "easier" to
generate a corrupted super block with this patch. Can you please
elaborate on this?

Thank you in advance,
Andrea

> >
> >>>> You want to make sure that when a dirty writeback-cached bdev is
> >>>> registered that it does not activate until either:
> >>>>
> >>>> 1. The cache is attached
> >>>> 2. echo 1 > /sys/block/bcache0/bcache/running
> >>>>
> >>>>> - A new IOCTL interface added, even all rested interface are sysfs based.
> >>>>
> >>>> True. Perhaps something like this would be better, and it would avoid
> >>>> sharing a userspace page for blindly populating `struct cache_sb`, too:
> >>>>
> >>>> echo '/dev/bdev [bdev_uuid] [/dev/cachedev|cset_uuid] [options]' > ¥
> >>>> /sys/fs/bcache/register_raw
> >>>
> >>> I thought that introducing an IOCTL interface could be a stopper for
> >>> this patch, so I'm more than welcome to refactor it using sysfs
> >>> entries only. But, I will do so only if there's even the slightest
> >>> chance that this patch can be merged.
> >>
> >> I'm for the patch with sysfs, but I'm not the decision maker either.
> >>
> >
> > As Eric already asked,
> >> Coly: what would your requirements be to accept this upstream?
> >>>>>
>
> You may use it as you want, but I won’t accept it into upstream via my path.
>
> Coly Li
>
>




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux