Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a):
On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote:
Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a):
Your proposal actually breaks this sequence and would move things to the
state of 'guess at which states we are now'. (and IMHO presents much more
risk than virtual problem with suspend from user-space - which is only a
problem if you are using suspended device as 'swap' and 'rootfs' - so there
are very easy ways how to orchestrate your LVs to avoid such problems).
The intent is less “guess what states we are now” and more “It looks
like dm-thin already has the data structures needed to store some
per-thin metadata, and that could make writing a simple userspace volume
manager FAR FAR easier”. It appears to me that the only change needed
I do not spend hours explaining all the details - but running just the suspend
alone may result in many differnt problem where the things like running
thin-pool out-of-data space is one of the easiest.
Basically each step must be designed with 'power-off' happen during the
operation. For each step you need to know how the recovery step looks like and
how the lvm2 & kernel metadata c/would match together. Combining many steps
together into a single 'kernel' call just increases already large range of
errors. So in many case we simply do favour to keep operation more
'low-level-atomic' even at slight higher performance price (as said - we've
never seen a creation of snapshot to be 'msec' critical operation - as the
'suspend' with implicit flush & fsfreeze itself might be far more expensive
operation.
But IMHO creation and removal of thousands of devices in very short period
of time rather suggest there is something sub-optimal in your original
software design as I'm really having hard time imagining why would you need
this ?
There very well could be (suggestions for improvement welcome).
If you wish to operate lots of devices - keep them simply created and ready
- and eventually blkdiscard them for next device reuse.
That would work for volatile volumes, but those are only about 1/3 of
the volumes in a Qubes OS system. The other 2/3 are writable snapshots.
Also, Qubes OS has found blkdiscard on thins to be a performance
problem. It used to lock up entire pools until Qubes OS moved to doing
the blkdiscard in chunks.
Always make sure you use recent Linux kernels.
Blkdiscard should not differ from lvremove too much - also experiment how
the 'lvchange --discards passdown|nopassdown poolLV' works.
I'm also unsure from where would arise any special need to instantiate that
many snapshots - if there is some valid & logical purpose - lvm2 can have
extended user space API to create multiple snapshots at once maybe (so
i.e. create 10 snapshots with name-%d of a single thinLV)
This would be amazing, and Qubes OS should be able to use it. That
said, Qubes OS would prefer to be able to choose the name of each volume
separately. Could there be a more general batching operation? Just
supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for
‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange
--activate=y’ as well would be even better.
There is kind of 'hidden' plan inside command line processing to allow
'grouped' processing.
lvcreate --snapshot --name lv1 --snapshot --name lv2 vg/origin
However there is currently no man power to proceed further on this part as we
have other parts of code needed enhancements.
But we may put this on our TODO plans...
Not to mentioning operating that many thin volumes from a single thin-pool
is also nothing close to high performance goal you try to reach...
Would you mind explaining? My understanding, and the basis of
essentially all my feature requests in this area, was that virtually all
of the cost of LVM is the userspace metadata operations, udev syncing,
and device scanning. I have been assuming that the kernel does not have
performance problems with large numbers of thin volumes.
The main idea behind the comment is - when there is increased disk usage -
the manipulation with thin-pool metadata and locking will soon start to be a
considerable performance problem.
So while it's easy to have active 1000 thinLVs from a single thin-pool that
are UNUSED, situation is dramatically different when there LVs would be in
some heavy use load. There you should keep the active thinLV at low number
of tens LVs, especially if you are performance oriented. The lighter usage
and less provisioning and especially bigger block size - improve
Right now, my machine has 334 active thin volumes, split between one
pool on an NVMe drive and one on a spinning hard drive. The pool on an
NVMe drive has 312 active thin volumes, of which I believe 64 are in use.
Are these numbers high enough to cause significant performance
penalties for dm-thin v1, and would they cause problems for dm-thin v2?
There are not yet any numbers for v2
For v1 - 64 thins might eventually experience some congestion for heavy load
(compared with 'native' raw spindle).
How much of a performance win can I expect from only activating the
subset of volumes I actually use?
I can only advice benchmark with some good approximation of your expected
workload.
In some cases it may appear your workload is not too sensitive to various
locking limitations.
Regards
Zdenek
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel