Re: LVM performance vs direct dm-thin

Zdenek Kabelac <zdenek.kabelac@xxxxxxxxx> · Sun, 30 Jan 2022 18:43:13 +0100

Dne 30. 01. 22 v 17:45 Demi Marie Obenour napsal(a):
On Sun, Jan 30, 2022 at 11:52:52AM +0100, Zdenek Kabelac wrote:
Dne 30. 01. 22 v 1:32 Demi Marie Obenour napsal(a):
On Sat, Jan 29, 2022 at 10:32:52PM +0100, Zdenek Kabelac wrote:
Dne 29. 01. 22 v 21:34 Demi Marie Obenour napsal(a):
How much slower are operations on an LVM2 thin pool compared to manually
managing a dm-thin target via ioctls?  I am mostly concerned about
volume snapshot, creation, and destruction.  Data integrity is very
important, so taking shortcuts that risk data loss is out of the
question.  However, the application may have some additional information
that LVM2 does not have.  For instance, it may know that the volume that
it is snapshotting is not in use, or that a certain volume it is
creating will never be used after power-off.

So brave developers may always write their own management tools for their
constrained environment requirements that will by significantly faster in
terms of how many thins you could create per minute (btw you will need to
also consider dropping usage of udev on such system)

What kind of constraints are you referring to?  Is it possible and safe
to have udev running, but told to ignore the thins in question?

Lvm2 is oriented more towards managing set of different disks,
where user is adding/removing/replacing them.  So it's more about
recoverability, good support for manual repair  (ascii metadata),
tracking history of changes,  backward compatibility, support
of conversion to different volume types (i.e. caching of thins, pvmove...)
Support for no/udev & no/systemd, clusters and nearly every linux distro
available... So there is a lot - and this all adds quite complexity.

I am certain it does, and that makes a lot of sense.  Thanks for the
hard work!  Those features are all useful for Qubes OS, too — just not
in the VM startup/shutdown path.

So once you scratch all this - and you say you only care about single disc
then you are able to use more efficient metadata formats which you could
even keep permanently in memory during the lifetime - this all adds great
performance.

But it all depends how you could constrain your environment.

It's worth to mention there is lvm2 support for 'external' 'thin volume'
creators - so lvm2 only maintains 'thin-pool' data & metadata LV - but thin
volume creation, activation, deactivation of thins is left to external tool.
This has been used by docker for a while - later on they switched to
overlayFs I believe..

That indeeds sounds like a good choice for Qubes OS.  It would allow the
data and metadata LVs to be any volume type that lvm2 supports, and
managed using all of lvm2’s features.  So one could still put the
metadata on a RAID-10 volume while everything else is RAID-6, or set up
a dm-cache volume to store the data (please correct me if I am wrong).
Qubes OS has already moved to using a separate thin pool for virtual
machines, as it prevents dom0 (privileged management VM) from being run
out of disk space (by accident or malice).  That means that the thin
pool use for guests is managed only by Qubes OS, and so the standard
lvm2 tools do not need to touch it.

Is this a setup that you would recommend, and would be comfortable using
in production?  As far as metadata is concerned, Qubes OS has its own
XML file containing metadata about all qubes, which should suffice for
this purpose.  To prevent races during updates and ensure automatic
crash recovery, is it sufficient to store metadata for both new and old
transaction IDs, and pick the correct one based on the device-mapper
status line?  I have seen lvm2 get in an inconsistent state (transaction
ID off by one) that required manual repair before, which is quite
unnerving for a desktop OS.

My biased advice would be to stay with lvm2. There is lot of work, many things 
are not well documented and getting everything running correctly will take a 
lot of effort  (Docker in fact did not managed to do it well and was incapable 
to provide any recoverability)

One feature that would be nice is to be able to import an
externally-provided mapping of thin pool device numbers to LV names, so
that lvm2 could provide a (read-only, and not guaranteed fresh) view of
system state for reporting purposes.

Once you will have evidence it's the lvm2 causing major issue - you could 
consider whether it's worth to step into a separate project.

It's worth to mention - the more bullet-proof you will want to make your
project - the more closer to the extra processing made by lvm2 you will get.

Why is this?  How does lvm2 compare to stratis, for example?

Stratis is yet another volume manager written in Rust combined with XFS for
easier user experience. That's all I'd probably say about it...

That’s fine.  I guess my question is why making lvm2 bullet-proof needs
so much overhead.

It's difficult - if you would be distributing lvm2 with exact kernel version & 
udev & systemd with a single linux distro - it reduces huge set of troubles...

However before you will step into these waters - you should probably
evaluate whether thin-pool actually meet your needs if you have that high
expectation for number of supported volumes - so you will not end up with
hyper fast snapshot creation while the actual usage then is not meeting your
needs...

What needs are you thinking of specifically?  Qubes OS needs block
devices, so filesystem-backed storage would require the use of loop
devices unless I use ZFS zvols.  Do you have any specific
recommendations?

As long as you live in the world without crashes, buggy kernels, apps  and
failing hard drives everything looks very simple.

Would you mind explaining further?  LVM2 RAID and cache volumes should
provide most of the benefits that Qubes OS desires, unless I am missing
something.

I'm not familiar with QubesOS - but in many cases in real-life world we can't 
push to our users latest&greatest - so we need to live with bugs and add 
workarounds...

And every development costs quite some time & money.

That it does.

Since you mentioned ZFS - you might want focus on using 'ZFS-only' solution.
Combining  ZFS or Btrfs with lvm2 is always going to be a painful way as
those filesystems have their own volume management.

Absolutely!  That said, I do wonder what your thoughts on using loop
devices for VM storage are.  I know they are slower than thin volumes,
but they are also much easier to manage, since they are just ordinary
disk files.  Any filesystem with reflink can provide the needed
copy-on-write support.

Chain filesystem->block_layer->filesystem->block_layer is something you most 
likely do not want to use for any well performing solution...
But it's ok for testing...

Regards

Zdenek

_______________________________________________
linux-lvm mailing list
linux-lvm@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/