On 12/19/2024 6:01 PM, Michael Kelley wrote:
From: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, December 19, 2024 3:39 PM
On 12/19/2024 1:37 PM, Michael Kelley wrote:
[...]
I would agree that the percentage savings is small. VMs often have
several hundred MiB to a few GiB of memory per vCPU. Saving a
4K page out of that amount of memory is a small percentage. The
thing to guard against, though, is applying that logic in many different
places in Linux kernel code. :-) The Hyper-V support in Linux already
has multiple pre-allocated per-vCPU pages, and by being a little bit
clever we might be able to avoid another one.
[...]
We will also need the vCPU assist pages and the context pages for the
lower VTLs, and the data don't currently occupy the entire pages. Yet
imho it is prudent to leave some wiggle room instead of painting
ourselves into the corner. We're not writing the code for MCUs, are
working under different constraints, and, yes, reaching to use a page as
that's the hard currency of virtualization imho. The numbers show that
savings are negligible per-CPU but these savings come at a cost of
making assumptions what will not happen in the future thus placing a bet
against what the specification says.
It's not even the hyperv code that is the largest consumer of the per-
CPU data, not even close. Looking at the `vmlinux`'es `.data..percpu`
section, there are almost 200 entries, and roughly one quarter is
pointer-sized so who really knows how much is going to be allocated per-
CPU. The top ten statically allocated are
nm -rS --size-sort ./vmlinux | grep -vF ffffffff8 | sed 10q
000000000000c000 0000000000008000 d exception_stacks
0000000000006000 0000000000005000 D cpu_tss_rw
0000000000002000 0000000000004000 D irq_stack_backing_store
000000000001b5c0 0000000000003180 d timer_bases
0000000000017000 0000000000003000 d bts_ctx
0000000000015520 0000000000001450 D cpu_hw_events
000000000000b000 0000000000001000 D gdt_page
0000000000014000 0000000000001000 d entry_stack_storage
0000000000001000 0000000000001000 D cpu_debug_store
0000000000021d80 0000000000000c40 D runqueues
on a configuration that is the bare minimum, no fluff. We could invest
into looking what would be the cost of compiling out `bts_ctx` or
`cpu_debug_store` instead of adding more if statements and making the
code look tricky.
I agree that a hypercall could produce up to 4 KiB of output in
response to up to 4 KiB of input. But the guest controls how much
input it passes. Furthermore, for all the hypercalls I'm familiar with,
the specification of the hypercall tells the max amount of output it
will produce in response to the input. That allows the guest to
know how much output space it needs to allocate and provide to
the hypercall.
I will concede that it's possible to envision a hypercall with a
specification that says "May produce up to 4 KiB of output. A header
at the beginning of the output says how much output was produced."
In that case, the guest indeed must supply a full page of output space.
But I don't think we have any such hypercalls now, and adding such a
hypercall in the future seems unlikely. Of course, if such a hypercall
did get added and Linux used that hypercall, Linux would need to
supply a full page for the hypercall output. That page wouldn't
necessarily need to be a pre-allocated per-vCPU hypercall output
page. Depending on the usage circumstances, that full page might be
able to be allocated on-demand.
But assume things proceed as they are today where Linux can limit
the amount of hypercall output based on the input. Then I don't
see a violation of the contract if Linux limits the output and fits
it within a page that is also being shared in a non-overlapping
way with any hypercall input. I wouldn't allocate a per-vCPU
hypercall output page now for a theoretically possible
hypercall that doesn't exist yet.
Given that:
* Using the hypercall output page is plumbed throughout the dom0 code,
* dom0 and the VTL mode can share code quite naturally,
* Not using the output page cannot acccount for all cases permitted by
the TLFS clause "don't overlap input and output, and don't cross page
boundaries",
* Not using the output page everywhere for consistency requires updating
call sites of `hv_do_hypercall` and friends, i.e. every place where a
hypercall is made needs to incorporate the offset/pointer at which the
output should start, or perhaps do some shenanigans with macro's,
* Not using the output page leads to negligible savings,
it is hard to see for me how to make not using the hypercall output page
look as a profitable enginnering endeavor, really comes off as dancing
to the perf scare/feature creep tune for peanuts.
In my opinion, we should use the hypercall output page for the VTL mode
as dom0 does to share code and reduce code churn. Had we used some
`hv_get_registers` in both instead of the specialized for no compelling
imo practical reason `get_vtl` (as it just gets a vCPU register, nothing
more, nothing less), this code path would've been better tested, and any
of this patching would not have been needed.
I'd wait for few days and then would likely prefer to run with Wei's
permission to send this patch in v2 as-is unless some damning evidence
presents itself.
Michael
[...]
--
Thank you,
Roman