Re: [PATCH 3/5] conf, schema, docs: Add support for TSEG size setting

Martin Kletzander <mkletzan@xxxxxxxxxx> · Sun, 3 Jun 2018 07:06:55 +0200

On Fri, Jun 01, 2018 at 08:21:52AM -0400, John Ferlan wrote:

[...]

First thanks for taking the time to elaborate - it is helpful. Much
better than just stating no because I don't like it ;-).

And thanks for appreciating that =)

1. Add poll-max-ns property of each iothread:
https://www.redhat.com/archives/libvir-list/2017-February/msg01047.html

This is about tunables.  It might change the performance/latency of QEMU
slightly, but that's about it.

and there are those that would find it useful to have (bz 1545732)... If

Much of what I read (underscore emphasis mine) suggests otherwise:

- "There is a lot of sentiment *against* providing too many low level knobs
   like this _without proper guidance_ on how they should be set."

- "To address this issue QEMU implements self-tuning algorithm that
modifies
   the current polling time to _adapt to different workloads_ and it can
also
   fallback to blocking syscalls."

- "The QEMU commits say the tunables all default to sane parameters so I'm
   inclined to say we ignore them at the libvirt level entirely."

- "I'm fine if libvirt doesn't add a dedicated API for setting <iothread>
   polling parameters.  It's _unlikely_ that users will need to change the
   setting.  In an emergency (e.g. disabling it due to a performance
   regression) _they can_ use <qemu:arg value='-newarg'/>."

The only points for the polling to be enabled were along the lines of:

It _may_ help in _some_ workloads when you want a bit more throughput
for the price of more CPU cycles.

With vague definitions of how much CPU, throughput and without
description of how to find out if a particular workload fits this.  Even
when all of that is there, then you need yet another explanation on how
to calculate the value to be set.  And then it all goes down back to the
fact that QEMU is already doing some automated balancing for this
(because they can, because this is not part of the guest ABI).  That way
you can never actually say if it will help and how much.

So for this one it is a clear "NO".

Another opposing viewpoint is:

https://bugzilla.redhat.com/show_bug.cgi?id=1545732#c8

If it were only a "documentation issue" - someone would have figured out
much earlier how to get beyond that.

I didn't mean it's "just" that, it's one of the parts.

FWIW: Without guidance in the Contributor's Guide over what is/isn't
acceptable I have a feeling we'll continue to see patches such as this
and the one below.

It's a good point that we could have that written somewhere.  It is very
difficult to write it so that it is as precise as possible, but also
generic enough so that we don't need many exceptions.

Still my point was less the actual feature or details of it, but rather
my feeling is there are more examples where exposing low level knobs has
been panned in the past. From the lack of details/knowledge I have/had
about TSEG during my original review - I saw it as just another low
level knob and while I saw value in the knob, I knew there has been a
high degree of sentiment in previous patches regarding adding such knobs
so I wanted to "be sure" it was desired/necessary adjustment by more
than just my opinion (it is a community after all, right)?

Yes, you're absolutely right.  We should do that and make sure we
understand the code (part of what the review is for).  It just twists my
toes when the review refers to someone else in a CYA way.  We need to
express our own opinions as those are the ones forming the final
consensus.  Of course common practice, contributor guidelines and all
other documentation should be taken into consideration as well.  But if
we just refer to other people opposing something we might end up having
an opinion that's not our own (I mean the community's one).  "5 monkeys
and a ladder" kind of a situation.  Of course we are all constantly
adjusting our own opinions along the way, but the more we need to
understand *why* we should (or should not) adjust them.

BTW: I'm not in disagreement that I found poll-max-ns as an odd tunable
to add for many of the reasons supplied. Of course I'm the one stuck
with the bz /-| and providing the "bad news" or just continually move
the bz to a future release ;-)

If there is no movement going on, then the BZ needs to be either closed
or what can be done needs to be found out.

you don't have enough memory and your VM is paging like crazy, you just
add more memory.  Requires a reboot. Likewise, if your VM doesn't boot
you add/alter the magic TSEG value using some algorithm as described
above. From a 90,000 foot customer view is there a difference? It's just
a knob that the hypervisor has to allow something to be accomplished for
which libvirt provides the attribute to fine tune.

Yeah, you're right.  That's why I think both of them should be exposed. 
Some
small differences to other knobs, just for completeness:

- by the time you realize that the VM doesn't have enough memory, it
might be
  too late as reboot isn't that easy of a thing for some production
workloads

- on the other hand, you have a way to see that happening (compare it to
the
  polling interval above which you have no idea without proper benchmarks)

One more thing that's common to the memory size (and I hope TSEG in the
future)
is that in mgmt apps the TSEG setting already has a place where to live
and it
is exactly where the memory size lives currently.  In templates.  You have
"small vm" teplate and "ginormous vm".  For the latter one you can just
add a
setting of TSEG _once_ per file.  How's TSEG better and easier than
memory?  You
figure it once for the VM settings (and possibly firmware, but that's
not going
to change much) and then it doesn't depend on the workflow, not even a
little
bit.

Anyway.

True TSEG is much more bounded and in your face when it doesn't work.
There still is this voice rumbling around in the back of my head that
says QEMU should be the owner of deciding upon the algorithm for the
value. Unlike a performance knob, it seems there's a solid way to
calculate a 'correct value' to make the boot work. The problem is if
that automatic calculation ended up being wrong at some point, then
there'd be no way to change the value without adding a knob. So, in a

Yes.  That's one of the problems.  The other one is that this is part of
the guest ABI.  However this one is not that big of a deal, or at least
it looks like it.  I'm going to go with not setting the default based on
what we know.  And maybe later we can set it when the domain is starting
by gathering that info from the QEMU process.

way the knob could be the exception rather than the rule. It's a
mechanism to make sure the guest can boot given outside interference.

What I see as the differences between tunables that make it in and
tunables that
don't is that:

- the former are usually understandable and easy to see what they are in
bare
  metal.  Everyone knows what memory is in the hardware, how it looks
like, how
  much is "not enough" and how much is "more that needed".  We are used to
  those things back from the hardware times, even to changing them.

Still the calculation of a proper TSEG value is based on multiple
factors (memory/vcpus). Historically I've found these also need a fudge
factor built in - it's the fudge factor that is the sticking point. On
real hardware you'd be told - well you don't have enough memory, so buy
some more - it's like printing money at that point for the sales guy.
You'll be guided to buy a more expensive and larger piece than you may
need to "ensure future expand-ability". You may not use the entire
thing, but you have it. For software it's a much easier knob.

- The latter is usually something we were not able control in HW or
didn't even
  know it existed.  For virtual workloads it might be completely
different, but
  sometimes people are forgetting that.

Sounds like a job for virtuned (or virtunefixd or virhighavaild). Years
ago I worked on a project that would essentially show bottlenecks for
the OS and provide the capability to "fix" those via various means
(whether it was CPU, memory, or disk overutilization... even deadlocks
and cluster quorum hangs).

2. Add support for qcow2 cache (many times, but most recently):
https://www.redhat.com/archives/libvir-list/2017-September/msg00553.html

Similarly here, it allows setting something that can be (at least
slightly) abstracted and in the worst case the performance will be
slightly hindered.

This one I understand more why it would be rejected, but still providing
the value allows certain things to work a whole lot better. I also know
Berto has been "fine tuning" the algorithm in later QEMU releases - so
that's like hitting a moving target.

This is very similar, it's just that there is no automatic balancing
done by QEMU.  But it usually is also about how you write the docs.  The
option can make very much sense, but if someone writes "Setting asdf can
allows fine-tuning of the asdf value in the underlying hypervisor", then
no matter how much that value makes sense it is not reflected in the
docs.  That's why I tried to add all the relevant info into the docs so
that it's clear what it is doing, how to set it, to what values and
when.

Apart from the fact that there is a "link" to some file in the QEMU
repository that someone is supposed to read, plus the decision for the
value determination are written there (but not why they are not
automatically calculated, or maybe I missed it), it:

- is not possible to try using the <qemu:arg value='-newarg'/> approach

- the docs say:
  <b>In general you should leave this option alone, unless you
     are very certain you know what you are doing.</b>

So in this particular case I wouldn't be totally against having it
there.  If you don't want to use it, then "just don't touch that" is an
approach that shouldn't hurt anyone.

Search the formatdomain page for 'unless' - there are examples where
knobs have been added that aren't well described and the consumer better
know what they're doing in order to use them. Perhaps another case of
alibistic behaviors (a/k/a CYA).

Maybe it's because I'm used to all this "QEMU settings talk", but I
understand all those that have the "unless you are very certain"
disclaimer.  Yeah, event_idx might not have been added before, but my
guess is that happened before everyone wanted to add every existing
option for QEMU and we didn't know what this is going to lead to.

Anyway you are right, "the consumer better know what they're doing",
that's why the disclaimer is there.  And I, personally, would take such
patch in.

[...]

And there are those that could say if the underlying hypervisor knows
that for certain memory sizes and/or vCPU counts that the TSEG will be
too small for specific machine types that then the underlying hypervisor
should be the one to "choose" a value that's programatically appropriate
which to a degree IIUC is the argument being used against allowing a
libvirt knob for the poll-max-ns and qcow2 cache sizes.

And they would be wrong as for TSEG the hypervisor a) doesn't know that
and b) cannot change that once it was started.

I think you lost me here.... From the bz problem statement:

"The necessary size is technically predictable (see bug 1468526 comment
8 point (2a) e.g.), but the formula is neither exact nor easy to
describe, so as a first step, libvirt should please expose this value in
an optional element or attribute."

I read that as a proper size could be calculated by the hypervisor, but
"just in case" let's make sure we have a fallback option. Perfectly
reasonable to me and even more pointed, (so far) only for q35. Of course
it's possible I read it wrong.

No, you read it right.  You nailed it above with the fudge factor.
Since visible from the guest as a HW-related information it is also part
of guest ABI and we strive to keep that stable.  But that one I already
discussed above.

We'll see how v2 goes ;)

John

[...]
Attachment:
signature.asc

Description: Digital signature
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list