On 4/12/22 05:32, Zdenek Kabelac wrote: > Dne 12. 04. 22 v 0:30 Demi Marie Obenour napsal(a): >> On Mon, Apr 11, 2022 at 10:16:43PM +0200, Zdenek Kabelac wrote: >>> Dne 11. 04. 22 v 19:22 Demi Marie Obenour napsal(a): >>>> On Mon, Apr 11, 2022 at 10:16:02AM +0200, Zdenek Kabelac wrote: >>>>> Dne 11. 04. 22 v 0:03 Demi Marie Obenour napsal(a): >>>>> >>>>> Your proposal actually breaks this sequence and would move things to the >>>>> state of 'guess at which states we are now'. (and IMHO presents much more >>>>> risk than virtual problem with suspend from user-space - which is only a >>>>> problem if you are using suspended device as 'swap' and 'rootfs' - so there >>>>> are very easy ways how to orchestrate your LVs to avoid such problems). >>>> The intent is less “guess what states we are now” and more “It looks >>>> like dm-thin already has the data structures needed to store some >>>> per-thin metadata, and that could make writing a simple userspace volume >>>> manager FAR FAR easier”. It appears to me that the only change needed >>> >>> I do not spend hours explaining all the details - but running just the >>> suspend alone may result in many differnt problem where the things like >>> running thin-pool out-of-data space is one of the easiest. >>> >>> Basically each step must be designed with 'power-off' happen during the >>> operation. For each step you need to know how the recovery step looks like >>> and how the lvm2 & kernel metadata c/would match together. >> That is absolutely the case, and is in fact the reason I proposed this >> change to begin with. By having dm-thin store a small amount of >> userspace-provided metadata for each thin volume, and by providing an >> API to enumerate the thin volumes in a pool, I can store all of the >> metadata I need in the thin pool itself. This is much simpler than >> having to store metadata outside of the pool. > > Hi > > Here is actually the fundamental problem with your proposal - our design was > about careful split between user-space and kernel 'who is the owner/holder of > information' - your proposal unfortunately does not fit the model where lvm2 > is the authoritative owner of info about devices - note - we also tried the > 'model' where the info is held within target - our mdraid dm wrapper - but it > has more troubles compared with very clear thin logic. So from the lvm2 > position - we do not have any plans to change this proven model. This does not surprise me. lvm2 already has the infrastructure to store its own metadata and update it in a crash-safe way, so having the kernel be able to store additional metadata would be of no benefit to lvm2. The intended use-case for this feature is tools that are dm-thin specific, and which do not already have such infrastructure. > What you are asking for is - that 'kernel' module is doing all the job - and > lvm2 would be obtaining info from the kernel metadata - and eventually you > would be able to command everything with ioctl() interface and letting the > complexity sit completely in kernel - but as explained our design is heading > in opposite direction - what can be done in user-space stays in user space and > kernel does the necessary minimum, which can be then much easier developed and > traced. Not at all. I just want userspace to be able to stash some data in each thin and retrieve it later. The complex logic would still remain in userspace. That’s why I dropped the “lookup thin by blob” functionality: it requires a new data structure in the kernel, and userspace can achieve almost the same effect with a cache. Qubes OS has a persistent daemon that has exclusive ownership of the storage, so there are no cache invalidation problems. The existing thin pool already has a btree that could store the blob, so no new data structures are required on the kernel side. >> Combining many >>> steps together into a single 'kernel' call just increases already large >>> range of errors. So in many case we simply do favour to keep operation more >>> 'low-level-atomic' even at slight higher performance price (as said - we've >>> never seen a creation of snapshot to be 'msec' critical operation - as the >>> 'suspend' with implicit flush & fsfreeze itself might be far more expensive >>> operation. >> Qubes OS should never be snapshotting an in-use volume of any kind. >> Right now, there is one case where it does so, but that is a bug, and I >> am working on fixing it. A future API might support snapshotting to an >> in-use volume, but that would likely require a way to tell the VM to >> freeze its own filesystem. > > > Yeah - you have very unusual use case - in fact lvm2 goal is usually to > support as much things as we can while devices are in-use so user does not > need to take them offline - which surely complicates everything a lot - also > there was basically never any user demand to operate with offline device in > very quick way - so admittedly not the focused area of development. I wouldn’t exactly say my use case is *that* unusual. My understanding is that Docker has the same one, and they too resorted to bypassing lvm2 and using raw dm-thin. Docker does have the advantage that the filesystem is assumed to be trusted, so they do not need eager zeroing. Stratis also uses dm-thin, and it, too, avoids using LVM. >>>>> But IMHO creation and removal of thousands of devices in very short period >>>>> of time rather suggest there is something sub-optimal in your original >>>>> software design as I'm really having hard time imagining why would you need >>>>> this ? >>>> There very well could be (suggestions for improvement welcome). >>>> >>>>> If you wish to operate lots of devices - keep them simply created and ready >>>>> - and eventually blkdiscard them for next device reuse. >>>> That would work for volatile volumes, but those are only about 1/3 of >>>> the volumes in a Qubes OS system. The other 2/3 are writable snapshots. >>>> Also, Qubes OS has found blkdiscard on thins to be a performance >>>> problem. It used to lock up entire pools until Qubes OS moved to doing >>>> the blkdiscard in chunks. >>> Always make sure you use recent Linux kernels. >> Should the 5.16 series be recent enough? >> >>> Blkdiscard should not differ from lvremove too much - also experiment how >>> the 'lvchange --discards passdown|nopassdown poolLV' works. >> I believe this was with passdown on, which is the default in Qubes OS. >> The bug was tracked down by Jinoh Kang in >> https://github.com/QubesOS/qubes-issues/issues/5426#issuecomment-761595524 >> and found to be due to dm-thin deleting B-tree nodes one at a time, >> causing large amounts of time to be wasted on btree rebalancing and node >> locking. >> >>>>> I'm also unsure from where would arise any special need to instantiate that >>>>> many snapshots - if there is some valid & logical purpose - lvm2 can have >>>>> extended user space API to create multiple snapshots at once maybe (so >>>>> i.e. create 10 snapshots with name-%d of a single thinLV) >>>> This would be amazing, and Qubes OS should be able to use it. That >>>> said, Qubes OS would prefer to be able to choose the name of each volume >>>> separately. Could there be a more general batching operation? Just >>>> supporting ‘lvm lvcreate’ and ‘lvm lvs’ would be great, but support for >>>> ‘lvm lvremove’, ‘lvm lvrename’, ‘lvm lvextend’, and ‘lvm lvchange >>>> --activate=y’ as well would be even better. >>> There is kind of 'hidden' plan inside command line processing to allow >>> 'grouped' processing. >>> >>> lvcreate --snapshot --name lv1 --snapshot --name lv2 vg/origin >>> >>> However there is currently no man power to proceed further on this part as >>> we have other parts of code needed enhancements. >>> >>> But we may put this on our TODO plans... >> That would be great, thanks! > > Although the main reason to support this kind of API was the request to > support an atomic snapshot of multiple LVs at once - but so far not a high > priority. Still, it will be useful if it becomes available. >>>>> Not to mentioning operating that many thin volumes from a single thin-pool >>>>> is also nothing close to high performance goal you try to reach... >>>> Would you mind explaining? My understanding, and the basis of >>>> essentially all my feature requests in this area, was that virtually all >>>> of the cost of LVM is the userspace metadata operations, udev syncing, >>>> and device scanning. I have been assuming that the kernel does not have >>>> performance problems with large numbers of thin volumes. >>> >>> The main idea behind the comment is - when there is increased disk usage - >>> the manipulation with thin-pool metadata and locking will soon start to be a >>> considerable performance problem. >>> >>> So while it's easy to have active 1000 thinLVs from a single thin-pool that >>> are UNUSED, situation is dramatically different when there LVs would be in >>> some heavy use load. There you should keep the active thinLV at low number >>> of tens LVs, especially if you are performance oriented. The lighter >>> usage and less provisioning and especially bigger block size - improve >> I can try to modify the storage pool so that LVs are not activated by >> default. That said, Qubes OS will always be provisioning-heavy. With >> the notable exception of volatile volumes, qubesd always snapshots a > > > You definitely should keep active ONLY LVs you need to have active - it's > impacting many other kernel areas and consumes system resources to keep > 'unused' LVs active. Thank you. There has already been work done in that direction and I will see if I can finish it now that R4.1 is out. >> volume at startup and then provides the snapshot to the VM. After >> shutdown, the original volume is renamed to be a backup, and the >> snapshot gets the name of the original volume. Bigger block sizes would >> substantially increase write amplification, as turning off zeroing is >> not an option for security reasons. > > For 'snapshot' heavy loads - smaller chunks are usually better - but it comes > with price. > > >> Is this just a workload that dm-thin is ill-suited for? Qubes OS does >> support storing VM images on either BTRFS or XFS files, and it could be >> that this is a better plan going forward. > > > Not knowing the details - but as mentioned 'zeroing' is not needed for > 'filesystem' security - modern filesystem will never let you read unwritten > data - as it keeps its own map of written data - but of course if the user > has root access to device with 'dd' it could read some 'unwritten' data on > that device... Qubes OS aims to defend against an attacker with kernel privilege in a VM, so zeroing is indeed necessary. The only way to avoid the overhead would be for dm-thin’s chunk size to match the block size of the paravirtualized disks, so that there are only full-chunk writes. The reflink storage pool does not have this problem because it works on 4K blocks natively. >>>> How much of a performance win can I expect from only activating the >>>> subset of volumes I actually use? >>> >>> I can only advice benchmark with some good approximation of your expected >>> workload. >> That’s already on the Qubes OS team’s (very long) to-do list. >> > I'd prioritize this - to get the best balance for performance - i.e. slightly > bigger chunks could give you much better numbers - if your 'snapshot' workload > is focused on small 'areas' so you know exactly where the focus should go > (too many cooks spoll the broth)... > > So even jump 64k -> 256K can be significant I will try to do a proper benchmark at some point. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab
Attachment:
OpenPGP_0xB288B55FFF9C22C1.asc
Description: OpenPGP public key
Attachment:
OpenPGP_signature
Description: OpenPGP digital signature
-- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel