Re: Potential enhancements to dm-thin v2

Zdenek Kabelac <zkabelac@xxxxxxxxxx> · Tue, 12 Apr 2022 11:32:09 +0200

Dne 12. 04. 22 v 0:30 Demi Marie Obenour napsal(a):
On Mon, Apr 11, 2022 at 10:16:43PM +0200, Zdenek Kabelac wrote:
Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):

Your proposal actually breaks this sequence and would move things to the
state of  'guess at which states we are now'. (and IMHO presents much more
risk than virtual problem with suspend from user-space - which is only a
problem if you are using suspended device as 'swap' and 'rootfs' - so there
are very easy ways how to orchestrate your LVs to avoid such problems).
The intent is less “guess what states we are now” and more “It looks
like dm-thin already has the data structures needed to store some
per-thin metadata, and that could make writing a simple userspace volume
manager FAR FAR easier”.  It appears to me that the only change needed

I do not spend hours explaining all the details - but running just the
suspend alone may result in many differnt problem where the things like
running thin-pool out-of-data space is one of the easiest.

Basically each step must be designed with  'power-off' happen during the
operation. For each step you need to know how the recovery step looks like
and how the lvm2 & kernel metadata c/would match together.
That is absolutely the case, and is in fact the reason I proposed this
change to begin with.  By having dm-thin store a small amount of
userspace-provided metadata for each thin volume, and by providing an
API to enumerate the thin volumes in a pool, I can store all of the
metadata I need in the thin pool itself.  This is much simpler than
having to store metadata outside of the pool.

Hi

Here is actually the fundamental problem with your proposal - our design was 
about careful split between user-space and kernel 'who is the owner/holder of 
information'  - your proposal unfortunately does not fit the model where lvm2 
is the authoritative owner of info about devices -   note - we also tried the 
'model' where the info is held within target - our mdraid  dm wrapper - but it 
has more troubles compared with very clear thin logic.  So from the lvm2 
position - we do not have any plans to change this proven model.

What you are asking for is - that 'kernel' module is doing all the job - and 
lvm2 would be obtaining info from the kernel metadata - and eventually you 
would be able to command everything with ioctl() interface and letting the 
complexity sit completely in kernel - but as explained our design is heading 
in opposite direction - what can be done in user-space stays in user space and 
kernel does the necessary minimum, which can be then much easier developed and 
traced.

Combining many
steps together into a single 'kernel' call just increases already large
range of errors.  So in many case we simply do favour to keep operation more
'low-level-atomic' even at slight higher performance price (as said - we've
never seen a creation of snapshot to be 'msec' critical operation - as  the
'suspend' with implicit flush & fsfreeze itself might be far more expensive
operation.
Qubes OS should never be snapshotting an in-use volume of any kind.
Right now, there is one case where it does so, but that is a bug, and I
am working on fixing it.  A future API might support snapshotting to an
in-use volume, but that would likely require a way to tell the VM to
freeze its own filesystem.

Yeah - you have very unusual use case  - in fact lvm2 goal is usually to 
support as much things as we can while devices are in-use so user does not 
need to take them offline - which surely complicates everything a lot -  also 
there was basically never any user demand to operate with offline device in 
very quick way - so admittedly not the focused area of development.

But IMHO creation and removal of thousands of devices in very short period
of time rather suggest there is something sub-optimal in your original
software design as I'm really having hard time imagining why would you need
this ?
There very well could be (suggestions for improvement welcome).

If you wish to operate lots of devices - keep them simply created and ready
- and eventually blkdiscard them for next device reuse.
That would work for volatile volumes, but those are only about 1/3 of
the volumes in a Qubes OS system.  The other 2/3 are writable snapshots.
Also, Qubes OS has found blkdiscard on thins to be a performance
problem.  It used to lock up entire pools until Qubes OS moved to doing
the blkdiscard in chunks.
Always make sure you use recent Linux kernels.
Should the 5.16 series be recent enough?

Blkdiscard should not differ from lvremove too much  - also experiment how
the  'lvchange --discards  passdown|nopassdown poolLV' works.
I believe this was with passdown on, which is the default in Qubes OS.
The bug was tracked down by Jinoh Kang in
https://github.com/QubesOS/qubes-issues/issues/5426#issuecomment-761595524
and found to be due to dm-thin deleting B-tree nodes one at a time,
causing large amounts of time to be wasted on btree rebalancing and node
locking.

I'm also unsure from where would arise any special need to instantiate  that
many snapshots -  if there is some valid & logical purpose -   lvm2 can have
extended user space API to create multiple snapshots at once maybe (so
i.e.    create  10 snapshots   with      name-%d  of a single thinLV)
This would be amazing, and Qubes OS should be able to use it.  That
said, Qubes OS would prefer to be able to choose the name of each volume
separately.  Could there be a more general batching operation?  Just
supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
--activate=y’ as well would be even better.
There is kind of 'hidden' plan inside command line processing to allow
'grouped'  processing.

lvcreate --snapshot  --name lv1  --snapshot --name lv2 vg/origin

However there is currently no man power to proceed further on this part as
we have other parts of code needed enhancements.

But we may put this on our TODO plans...
That would be great, thanks!

Although the main reason to support this kind of API was the request to 
support an atomic snapshot of multiple LVs at once - but so far not a high 
priority.

Not to mentioning operating that many thin volumes from a single thin-pool
is also nothing close to high performance goal you try to reach...
Would you mind explaining?  My understanding, and the basis of
essentially all my feature requests in this area, was that virtually all
of the cost of LVM is the userspace metadata operations, udev syncing,
and device scanning.  I have been assuming that the kernel does not have
performance problems with large numbers of thin volumes.

The main idea behind the comment is -  when there is increased disk usage -
the manipulation with thin-pool metadata and locking will soon start to be a
considerable performance problem.

So while it's easy to have active  1000 thinLVs from a single thin-pool that
are UNUSED, situation is dramatically different when there LVs would be in
some heavy use load.  There you should keep the active thinLV at low number
of  tens  LVs, especially if you are performance oriented.  The lighter
usage and less provisioning and especially bigger block size - improve
I can try to modify the storage pool so that LVs are not activated by
default.  That said, Qubes OS will always be provisioning-heavy.  With
the notable exception of volatile volumes, qubesd always snapshots a

You definitely should keep active ONLY LVs you need to have active - it's 
impacting many other kernel areas and consumes system resources to keep 
'unused' LVs active.

volume at startup and then provides the snapshot to the VM.  After
shutdown, the original volume is renamed to be a backup, and the
snapshot gets the name of the original volume.  Bigger block sizes would
substantially increase write amplification, as turning off zeroing is
not an option for security reasons.

For 'snapshot' heavy loads -  smaller chunks are usually better - but it comes 
with price.

Is this just a workload that dm-thin is ill-suited for?  Qubes OS does
support storing VM images on either BTRFS or XFS files, and it could be
that this is a better plan going forward.

Not knowing the details - but as mentioned  'zeroing'  is not needed for 
'filesystem' security - modern filesystem will never let you read unwritten 
data - as it keeps its own map of written data  - but of course if the user 
has root access to device  with 'dd'  it could read some 'unwritten' data on 
that device...

How much of a performance win can I expect from only activating the
subset of volumes I actually use?

I can only advice benchmark with some good approximation of your expected
workload.
That’s already on the Qubes OS team’s (very long) to-do list.

I'd prioritize this - to get the best balance for performance  - i.e. slightly 
bigger chunks could give you much better numbers - if your 'snapshot' workload 
is focused on small 'areas' so you know exactly where the focus should go    
(too many cooks spoll the broth)...

So even jump 64k -> 256K can be significant

Regards

Zdenek

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel