Re: [PATCH 2/2] hyperv: Do not overlap the input and output hypercall areas in get_vtl(void)

Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> · Thu, 19 Dec 2024 15:39:10 -0800

On 12/19/2024 1:37 PM, Michael Kelley wrote:
From: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, December 19, 2024 10:19 AM

[...]

There will surely be more hypercall usage in the VTL mode that return
data and require the output pages as we progress with upstreaming the
VTL patches. Enabling the hypercall output pages allows to fix the
function in question in a very natural way, making it possible to
replace with some future `hv_get_vp_register` that would work for both
dom0 and VTL mode just the same.

All told, if you believe that we should make this patch a one-liner,
I'll do as you suggested.

FWIW, Roman and I had this same discussion back in August. See [1].
I'll add one new thought that wasn't part of the August discussion.
Michael, thank you very much for helping out in finding a better
solution!

To my knowledge, the hypercalls that *may* use a full page for input
and/or a full page for output don't actually *require* a full page. The
size of the input and output depends on how many "entries" the
hypercall is specified to process, where "entries" could be registers,
memory PFNs, or whatever. I would expect the code to invoke these
hypercalls must already deal with the case where the requested number
of entries causes the input or output size to exceed one page, so the
code just iterates making multiple invocations of the hypercall with
a "batch size" that fits in one page.
That is what I see in the code, too. The TLFS requires to use a page
worth of data maximum ("cannot overlap or cross page boundaries"), hence
the hypercall code shall chunk up the input data appropriately.

It would be perfectly reasonable to limit the batch size so that a
"batch" of input or output fits in a half page instead of a full page,
avoiding the need to allocate hyperv_pcpu_output_arg. Or if the
input and output sizes are not equal, use whatever input vs. output
partitioning of a single page make sense for that hypercall. The
tradeoff, of course, is having to make the hypercall more times
with smaller batches. But if the hypercall is not in a hot path, this
might be a reasonable tradeoff to avoid the additional memory
allocation. Or if the hypercall processing time per "entry" is high,
the added overhead of more invocations with smaller batches is
probably negligible compared to the time processing the entries.
The hypervisor yields control back to the guest if the hypervisor
spends more than ~a dozen 1e-6 sec in the hypercall processing, and
the processing isn't done yet. When yielding the control back, the
hypervisor doesn't advance the instruction pointer so the guest can
process interrupts on the vCPU (if the guest didn't mask them), and
get back to processing the hypercall. That helps the guest stay
responsive if the guest chose to send larger batches. For smaller
batches, have to consider the cost of invoking the hypercall as you
are pointing out. On the topic of saving CPU time, there are also fast
hypercalls that pass data in the CPU registers.

This scheme could also be used in the existing root partition code
that is currently the only user of the hyperv_pcpu_output_arg.
I could see a valid argument being made to drop
hyperv_pcpu_output_arg entirely and just use smaller batches.
In my view, that is a reasonable idea to cut down on memory usage.
Here is what I get applying that general idea to this specific
situation (and sticking to 4KiB as the page size).

We are going to save 4KiB * N of memory where N is the number of cores
allocated to the root partition. Let us also introduce M that denotes
the amount of memory in KiB allocated to the root partition.

Given that the order of magnitude for N is 1 (log_10(N) ~= 1), and the
order of magnitude for M is 6..7, the savings (~=10^(N-M)=1e-5) look
rather meager to my eye. That might be a judgement call, I wouldn't
argue that.

What is worrisome is that the guest goes against the specification. The
specification decrees: the input and output areas for the hypercall
shall not cross page boundaries and shall not overlap.
Hence, the hypervisor is within its right to produce up to 4KiB of
output in response to up to 4KiBs of input, and we have:

```
sizeof(input) + sizeof(output) <= 2*sizeof(page)
```

But when the guest doesn't use the output page, we obviously have

```
sizeof(input) + sizeof(output) <= sizeof(page)
```
on the guest side so the contract is broken.

The hypervisor would need to know that the guest optimizes its memory
usage in this way, limiting what is allowed by the specification when
implementing any new hypercalls.

Or are there hypercalls where a smaller batch size doesn't work
at all or is a bad tradeoff for performance reasons? I know I'm not
familiar with *all* the hypercalls that might be used in root
partition or VTL code. If there are such hypercalls, I would be curious
to know about them.
Nothing that I could find in the specification. I wouldn't think that
justifies investing in creating specialized/special-cased functions
on the guest side without solid evidence that more code is needed. In
this particular case, one day, I'd love to replace `get_vtl` with
one generic function `hv_get_vp_registers` that works both for dom0 and
VTLs, plays by the book and does not require much/any explaining what is
going on inside it and why. I believe this will make maintenance easier.

Michael

[1] https://lore.kernel.org/linux-hyperv/SN6PR02MB415759676AEF931F030430FDD4BE2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

--
Thank you,
Roman