On 12/20/2024 2:42 PM, Michael Kelley wrote:
From: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> Sent: Friday, December 20, 2024 11:14 AM
[...]
I will not object to your preferred path of allocating the
hyperv_pcpu_output_arg when CONFIG_HYPERV_VTL_MODE is set. For
me, it has been a worthwhile discussion to explore the alternatives. I do,
however, want to get further clarification about the spec requirements
and whether our current Linux code is meeting those requirements. More
on that below. And allow me to continue the discussion about a couple
of points you raised.
Thank you, Michael, for helping me explore the options! From where I
sit, we both seem to agree that an update is needed, and where our
opinions diverge is whether we need to spend another per-CPU page or
not. If at any point, you start believing that your "Nack" is the best
option going forward, I'll put this patch aside.
[...]
I see three uses in current upstream code. And then there are close to
a dozen more uses in Nuno's year-old patch set for /dev/mshv that hasn't
been accepted upstream yet. Does that roughly match what you are referring
to?
While compiling my arguments and making sure they're worth the readers'
time, I did miss to provide references to the code. Yes, that patch set
might be a good illustration of the direction, thank you very much for
pointing that out!
Another reference might have been the kernel used with OpenHCL (6.6.x):
https://github.com/microsoft/OHCL-Linux-Kernel. Azure Boost runs that,
and the kernel is delighting our customers with the rock-solid
foundation.
It contains lots of mshv code, and it is the code run in Azure as the
VTL2 kernel; we're working towards upstreaming, e.g. here :) It uses the
hypercall output page, and the performance folks beat the hell out of
the code, and found it being up to snuff. There were few findings about
memory consumption out of which I remember the perf folks tuning SWIOTLB
size and tweaking the kernel memory layout to save on padding and
alignment. As for the per-CPU data, nothing allowing to save that much,
perhaps would be nice to be able to compile out more. Yet, as I
understand, x86_64 systems might not always be envisioned as the
commonplace foundation for building embedded systems so while we're
looking to save few KiBs in hyperv, there's likely much more to save
elsewhere.
* dom0 and the VTL mode can share code quite naturally,
Yep.
To me, this has been the ever-shining light in the dark maze of other
arguments. It provides a corner stone to building the generic well-
tested code, hence developers will rest assured when introducing changes
that the changes are robust even when not testing the myriad ways the
hyperv code is used in. That is, the validation wouldn't need all the
permutations of conditional statements and all #ifdef's that would be
needed when special-casing.
* Not using the output page cannot acccount for all cases permitted by
the TLFS clause "don't overlap input and output, and don't cross page
boundaries",
This is what I don’t understand.
I meant that when implementing a generic case, not using the output page
prohibits from being able to receive a page of output when supplying a
page of input. I agree that we can special-case to save a bit of memory
as it appears we don't need that ability (4KiB of output in response to
4KiB input) right now. What I question though if that specialization is
needed at all since it will lead to more code, more parameters for the
functions, more comments and if statements therefore more validation
work and more cognitive load, and more maintenance costs.
* Not using the output page everywhere for consistency requires updating
call sites of `hv_do_hypercall` and friends, i.e. every place where a
hypercall is made needs to incorporate the offset/pointer at which the
output should start, or perhaps do some shenanigans with macro's,
In my musing about getting rid of hyperv_pcpu_output_arg entirely, the
call signature of hv_do_hypercall() and friends would remain unchanged.
There would still be a virtual address passed as the output argument (which
might be NULL, and is *not* required to be page aligned). The caller of
hv_do_hypercall() must only ensure that the virtual address points to an
output area that is large enough to contain the output without crossing a
page boundary. The area must also not overlap with the input area. Given
that we currently have only 3 uses of hyperv_pcpu_output_arg in upstream
code, only those would need to be tweaked to split the hyperv_pcpu_input_arg
page into an input and output area. All other callers of hv_do_hypercall()
would be unchanged.
Your approach appears being solid to me! Again, going to be beating on
my drum :) I believe here we are deciding not only on the small
optimization which is a clever way of saving a bit of memory, and it did
bring a satisfying "wow, nice" to my mind.
We are drafting the constrains for the future code as it appears;
without knowing better, perhaps it is the wisest not to constrain. As
the quote goes:
"... We should forget about small efficiencies, say about 97% of the
time: premature optimization is the root of all evil. Yet we should not
pass up our opportunities in that critical 3%. A good programmer ...
will be wise to look carefully at the critical code; but only after that
code has been identified. ..."
* Not using the output page leads to negligible savings,
it is hard to see for me how to make not using the hypercall output page
look as a profitable enginnering endeavor, really comes off as dancing
to the perf scare/feature creep tune for peanuts.
There would be some simplification in hv_common_free(), hv_common_init(),
and hv_common_cpu_init() if there was no hyperv_pcpu_output_arg. :-)
Agreed!
In my opinion, we should use the hypercall output page for the VTL mode
as dom0 does to share code and reduce code churn. Had we used some
`hv_get_registers` in both instead of the specialized for no compelling
imo practical reason `get_vtl` (as it just gets a vCPU register, nothing
more, nothing less), this code path would've been better tested, and any
of this patching would not have been needed.
FWIW, the historical context is that at the time get_vtl() was implemented,
all VP registers that Linux needed to access were available as synthetic
MSRs on the x86 side. Going back to its origins 15 years ago, the Linux code
for Hyper-V always just used the synthetic MSRs. get_vtl() was the first time
that x86 code needed the HVCALL_GET_VP_REGISTERS hypercall because
HV_REGISTER_VSM_VP_STATUS has no synthetic MSR equivalent. There still
is no 'hv_get_registers' function in upstream code on the x86 side.
hv_get_vpreg() exists only on the arm64 side where there are no synthetic
MSRs or equivalent.
In hindsight, hv_get_vpreg() could have been added on the x86 side, and
then get_vtl() implemented on top of that. And I made the get_vtl() situation
worse by giving Tianyu bad advice about overlapping the output area with
the input area. :-(
I once had a manager at Microsoft who said "He reserved the right
to wake up smarter every day." I've had to claim that right as well
from time-to-time. :-)
Michael, thank you very much for your efforts! At that time, no one
knew otherwise iiuc. You had the willpower to make things happen and the
vision to bring the solution over the finish line, and imo the Hyper-V
community just cannot possibly thank you enough for your incredible
work. It is my genuine hope that putting the Fixes tag didn't overshadow
that fact for anyone as these five letters don't tell any of the story,
most assuredly.
I'd wait for few days and then would likely prefer to run with Wei's
permission to send this patch in v2 as-is unless some damning evidence
presents itself.
Again, I won't object to your taking that path.
But regarding compliance with the TLFS, take a look at the code for
hv_pci_read_mmio() and hv_pci_write_mmio(). Do you think that code
violates the spec? If not, what would a violation look like in the context
of this discussion? If current code is in violation, then we should fix that.
Appreciate your advice! I don't think they do. The code doesn't overlap
input and output, and to my eye, the arguments are sized and aligned
appropriately. That does not violate TLFS. I'd think the
`hv_pci_read_mmio` might've used the output page as it is supposed to be
used for output to stick to one pattern, much similar to the `get_vtl`
in question.
Now, it seems the functions run in the VTL0 only (but I'd leave that to
someone else to judge) where we currently don't allocate the output
page, and `hv_pci_read_mmio` is the only function doing that input page
split so special-casing might be justified, perhaps adding a comment
would seem appropriate. Unless there are prospects to merge that with
dom0, I'd leave the code be.
Michael
--
Thank you,
Roman