Re: [PATCH 0/5] block: loop: add file format subsystem and QCOW2 file format driver

Simon Rettberg <simon.rettberg@xxxxxxxxxxxxxxxxxx> · Wed, 18 Sep 2019 12:26:02 +0200

Hi everyone,

chiming in for clearing this up a bit.

> Got it, looks a good use case for compression, but not has to be
> QCOW2.
> 
> > 
> > The network boot infrastructure is based on a classical PXE network
> > boot to load the Linux kernel and the initramfs. In the initramfs,
> > the compressed QCOW2 image is fetched via nfs or cifs or something
> > else. The fetched QCOW2 image is now decompressed and read in the
> > kernel. Compared to a decompression and read in the user space,
> > like qemu-nbd does, this approach does not need any user space
> > process, is faster and avoids switchroot problems.  
> 
> This image can be compressed via xz, and fetched via wget or what
> ever. 'xz' could have better compression ratio than qcow2, I guess.

"Fetch" was probably a bit ambiguous. The image isn't downloaded, but
mounted directly from the network (streamed?), so we can benefit from
the per-cluster compression of qcow2, similar to squashfs but on the
block layer. A typical image is between 3 and 10GB with qcow2
compression, so downloading it entirely on boot to be able to
decompress it is not feasible.

> As I mentioned above, seems not necessary to introduce loop-qcow2.

Yes, there are many ways to achieve this. The basic concept of network
booting the workstations has been practiced here for almost 15 years
now using very different approaches like plain old NFS mounts for the
root filesystem, squashfs containers that get downloaded, or streamed
over network. But since our requirement is a stateless system, we need
a copy-on-write layer on top of this. In the beginnings we did this
with unionfs and then aufs, but as these operate on the file-system
layer they have several drawbacks and relatively high complexity
compared to block-layer CoW. So we switched to a block-based approach
about 4 years ago. For reasons stated before, we wanted to use some
form of compression, as was possible with squashfs before, so after
some experimenting, qcow2 proved to be a good fit. However, adding in
user-space tools like qemu-nbd or xmount added too much of a
performance penalty and initially, also some problems during the
switchroot from initrd to the actual root file system.

So the current process looks as follows: kernel + initrd are
loaded via iPXE. initrd sets up network, mounts NFS share or connects
to server via NBD to access the qcow2 image. Modified losetup sets up
access to qcow2 image, either from NFS share or
directly from /dev/nbd0. Finally, mount /dev/loop0pXX and switch to new
root.

Manuel's implementation has so far proven to be very reliable and
brought noticeable performance improvements compared to having a user
space process doing the qcow2 handling.

So we would have really liked the idea of having his changes
upstreamed, I think he did a very good job by designing a plugin
infrastructure for the loop device and making the qcow2 plugin a
separate module. We knew about the concerns of adding code for handling
a file format in the kernel and were hoping that maybe an acceptable
compromise would be to have his changes added to the kernel minus the
actual qcow2 plugin, so it is mostly a refactoring of the old loop
device that's not adding too much complexity (hopefully). But if we're
really such an oddball use-case here that this won't possibly be of any
interest to anybody else we will just have to go forward maintaining
this out of tree entirely.

Thanks for your time,
Simon