On Fri, Jun 01, 2018 at 08:21:52AM -0400, John Ferlan wrote:
[...] First thanks for taking the time to elaborate - it is helpful. Much better than just stating no because I don't like it ;-).
And thanks for appreciating that =)
1. Add poll-max-ns property of each iothread: https://www.redhat.com/archives/libvir-list/2017-February/msg01047.htmlThis is about tunables. It might change the performance/latency of QEMU slightly, but that's about it.and there are those that would find it useful to have (bz 1545732)... IfMuch of what I read (underscore emphasis mine) suggests otherwise: - "There is a lot of sentiment *against* providing too many low level knobs like this _without proper guidance_ on how they should be set." - "To address this issue QEMU implements self-tuning algorithm that modifies the current polling time to _adapt to different workloads_ and it can also fallback to blocking syscalls." - "The QEMU commits say the tunables all default to sane parameters so I'm inclined to say we ignore them at the libvirt level entirely." - "I'm fine if libvirt doesn't add a dedicated API for setting <iothread> polling parameters. It's _unlikely_ that users will need to change the setting. In an emergency (e.g. disabling it due to a performance regression) _they can_ use <qemu:arg value='-newarg'/>." The only points for the polling to be enabled were along the lines of: It _may_ help in _some_ workloads when you want a bit more throughput for the price of more CPU cycles. With vague definitions of how much CPU, throughput and without description of how to find out if a particular workload fits this. Even when all of that is there, then you need yet another explanation on how to calculate the value to be set. And then it all goes down back to the fact that QEMU is already doing some automated balancing for this (because they can, because this is not part of the guest ABI). That way you can never actually say if it will help and how much. So for this one it is a clear "NO".Another opposing viewpoint is: https://bugzilla.redhat.com/show_bug.cgi?id=1545732#c8 If it were only a "documentation issue" - someone would have figured out much earlier how to get beyond that.
I didn't mean it's "just" that, it's one of the parts.
FWIW: Without guidance in the Contributor's Guide over what is/isn't acceptable I have a feeling we'll continue to see patches such as this and the one below.
It's a good point that we could have that written somewhere. It is very difficult to write it so that it is as precise as possible, but also generic enough so that we don't need many exceptions.
Still my point was less the actual feature or details of it, but rather my feeling is there are more examples where exposing low level knobs has been panned in the past. From the lack of details/knowledge I have/had about TSEG during my original review - I saw it as just another low level knob and while I saw value in the knob, I knew there has been a high degree of sentiment in previous patches regarding adding such knobs so I wanted to "be sure" it was desired/necessary adjustment by more than just my opinion (it is a community after all, right)?
Yes, you're absolutely right. We should do that and make sure we understand the code (part of what the review is for). It just twists my toes when the review refers to someone else in a CYA way. We need to express our own opinions as those are the ones forming the final consensus. Of course common practice, contributor guidelines and all other documentation should be taken into consideration as well. But if we just refer to other people opposing something we might end up having an opinion that's not our own (I mean the community's one). "5 monkeys and a ladder" kind of a situation. Of course we are all constantly adjusting our own opinions along the way, but the more we need to understand *why* we should (or should not) adjust them.
BTW: I'm not in disagreement that I found poll-max-ns as an odd tunable to add for many of the reasons supplied. Of course I'm the one stuck with the bz /-| and providing the "bad news" or just continually move the bz to a future release ;-)
If there is no movement going on, then the BZ needs to be either closed or what can be done needs to be found out.
you don't have enough memory and your VM is paging like crazy, you just add more memory. Requires a reboot. Likewise, if your VM doesn't boot you add/alter the magic TSEG value using some algorithm as described above. From a 90,000 foot customer view is there a difference? It's just a knob that the hypervisor has to allow something to be accomplished for which libvirt provides the attribute to fine tune.Yeah, you're right. That's why I think both of them should be exposed. Some small differences to other knobs, just for completeness: - by the time you realize that the VM doesn't have enough memory, it might be too late as reboot isn't that easy of a thing for some production workloads - on the other hand, you have a way to see that happening (compare it to the polling interval above which you have no idea without proper benchmarks) One more thing that's common to the memory size (and I hope TSEG in the future) is that in mgmt apps the TSEG setting already has a place where to live and it is exactly where the memory size lives currently. In templates. You have "small vm" teplate and "ginormous vm". For the latter one you can just add a setting of TSEG _once_ per file. How's TSEG better and easier than memory? You figure it once for the VM settings (and possibly firmware, but that's not going to change much) and then it doesn't depend on the workflow, not even a little bit. Anyway.True TSEG is much more bounded and in your face when it doesn't work. There still is this voice rumbling around in the back of my head that says QEMU should be the owner of deciding upon the algorithm for the value. Unlike a performance knob, it seems there's a solid way to calculate a 'correct value' to make the boot work. The problem is if that automatic calculation ended up being wrong at some point, then there'd be no way to change the value without adding a knob. So, in a
Yes. That's one of the problems. The other one is that this is part of the guest ABI. However this one is not that big of a deal, or at least it looks like it. I'm going to go with not setting the default based on what we know. And maybe later we can set it when the domain is starting by gathering that info from the QEMU process.
way the knob could be the exception rather than the rule. It's a mechanism to make sure the guest can boot given outside interference. What I see as the differences between tunables that make it in andtunables that don't is that: - the former are usually understandable and easy to see what they are in bare metal. Everyone knows what memory is in the hardware, how it looks like, how much is "not enough" and how much is "more that needed". We are used to those things back from the hardware times, even to changing them.Still the calculation of a proper TSEG value is based on multiple factors (memory/vcpus). Historically I've found these also need a fudge factor built in - it's the fudge factor that is the sticking point. On real hardware you'd be told - well you don't have enough memory, so buy some more - it's like printing money at that point for the sales guy. You'll be guided to buy a more expensive and larger piece than you may need to "ensure future expand-ability". You may not use the entire thing, but you have it. For software it's a much easier knob.- The latter is usually something we were not able control in HW or didn't even know it existed. For virtual workloads it might be completely different, but sometimes people are forgetting that.Sounds like a job for virtuned (or virtunefixd or virhighavaild). Years ago I worked on a project that would essentially show bottlenecks for the OS and provide the capability to "fix" those via various means (whether it was CPU, memory, or disk overutilization... even deadlocks and cluster quorum hangs).2. Add support for qcow2 cache (many times, but most recently): https://www.redhat.com/archives/libvir-list/2017-September/msg00553.htmlSimilarly here, it allows setting something that can be (at least slightly) abstracted and in the worst case the performance will be slightly hindered.This one I understand more why it would be rejected, but still providing the value allows certain things to work a whole lot better. I also know Berto has been "fine tuning" the algorithm in later QEMU releases - so that's like hitting a moving target.This is very similar, it's just that there is no automatic balancing done by QEMU. But it usually is also about how you write the docs. The option can make very much sense, but if someone writes "Setting asdf can allows fine-tuning of the asdf value in the underlying hypervisor", then no matter how much that value makes sense it is not reflected in the docs. That's why I tried to add all the relevant info into the docs so that it's clear what it is doing, how to set it, to what values and when. Apart from the fact that there is a "link" to some file in the QEMU repository that someone is supposed to read, plus the decision for the value determination are written there (but not why they are not automatically calculated, or maybe I missed it), it: - is not possible to try using the <qemu:arg value='-newarg'/> approach - the docs say: <b>In general you should leave this option alone, unless you are very certain you know what you are doing.</b> So in this particular case I wouldn't be totally against having it there. If you don't want to use it, then "just don't touch that" is an approach that shouldn't hurt anyone.Search the formatdomain page for 'unless' - there are examples where knobs have been added that aren't well described and the consumer better know what they're doing in order to use them. Perhaps another case of alibistic behaviors (a/k/a CYA).
Maybe it's because I'm used to all this "QEMU settings talk", but I understand all those that have the "unless you are very certain" disclaimer. Yeah, event_idx might not have been added before, but my guess is that happened before everyone wanted to add every existing option for QEMU and we didn't know what this is going to lead to. Anyway you are right, "the consumer better know what they're doing", that's why the disclaimer is there. And I, personally, would take such patch in.
[...]And there are those that could say if the underlying hypervisor knows that for certain memory sizes and/or vCPU counts that the TSEG will be too small for specific machine types that then the underlying hypervisor should be the one to "choose" a value that's programatically appropriate which to a degree IIUC is the argument being used against allowing a libvirt knob for the poll-max-ns and qcow2 cache sizes.And they would be wrong as for TSEG the hypervisor a) doesn't know that and b) cannot change that once it was started.I think you lost me here.... From the bz problem statement: "The necessary size is technically predictable (see bug 1468526 comment 8 point (2a) e.g.), but the formula is neither exact nor easy to describe, so as a first step, libvirt should please expose this value in an optional element or attribute." I read that as a proper size could be calculated by the hypervisor, but "just in case" let's make sure we have a fallback option. Perfectly reasonable to me and even more pointed, (so far) only for q35. Of course it's possible I read it wrong.
No, you read it right. You nailed it above with the fudge factor. Since visible from the guest as a HW-related information it is also part of guest ABI and we strive to keep that stable. But that one I already discussed above. We'll see how v2 goes ;)
John [...]
Attachment:
signature.asc
Description: Digital signature
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list