RE: [PATCH 2/2] hyperv: Do not overlap the input and output hypercall areas in get_vtl(void)

Michael Kelley <mhklinux@xxxxxxxxxxx> · Fri, 20 Dec 2024 22:42:49 +0000

From: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> Sent: Friday, December 20, 2024 11:14 AM
> 
> On 12/19/2024 6:01 PM, Michael Kelley wrote:
> > From: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, December 19,
> 2024 3:39 PM
> >>
> >> On 12/19/2024 1:37 PM, Michael Kelley wrote:
> 
> [...]
> 
> > I would agree that the percentage savings is small. VMs often have
> > several hundred MiB to a few GiB of memory per vCPU. Saving a
> > 4K page out of that amount of memory is a small percentage. The
> > thing to guard against, though, is applying that logic in many different
> > places in Linux kernel code. :-)  The Hyper-V support in Linux already
> > has multiple pre-allocated per-vCPU pages, and by being a little bit
> > clever we might be able to avoid another one.
> >
> 
> [...]
> 
> We will also need the vCPU assist pages and the context pages for the
> lower VTLs, and the data don't currently occupy the entire pages. Yet
> imho it is prudent to leave some wiggle room instead of painting
> ourselves into the corner. We're not writing the code for MCUs, are
> working under different constraints, and, yes, reaching to use a page as
> that's the hard currency of virtualization imho. The numbers show that
> savings are negligible per-CPU but these savings come at a cost of
> making assumptions what will not happen in the future thus placing a bet
> against what the specification says.

I will not object to your preferred path of allocating the
hyperv_pcpu_output_arg when CONFIG_HYPERV_VTL_MODE is set. For
me, it has been a worthwhile discussion to explore the alternatives. I do,
however, want to get further clarification about the spec requirements
and whether our current Linux code is meeting those requirements. More
on that below.  And allow me to continue the discussion about a couple
of points you raised.

> 
> It's not even the hyperv code that is the largest consumer of the per-
> CPU data, not even close. Looking at the `vmlinux`'es `.data..percpu`
> section, there are almost 200 entries, and roughly one quarter is
> pointer-sized so who really knows how much is going to be allocated per-
> CPU. The top ten statically allocated are
> 
> nm -rS --size-sort ./vmlinux | grep -vF ffffffff8 | sed 10q
> 
> 000000000000c000 0000000000008000 d exception_stacks
> 0000000000006000 0000000000005000 D cpu_tss_rw
> 0000000000002000 0000000000004000 D irq_stack_backing_store
> 000000000001b5c0 0000000000003180 d timer_bases
> 0000000000017000 0000000000003000 d bts_ctx
> 0000000000015520 0000000000001450 D cpu_hw_events
> 000000000000b000 0000000000001000 D gdt_page
> 0000000000014000 0000000000001000 d entry_stack_storage
> 0000000000001000 0000000000001000 D cpu_debug_store
> 0000000000021d80 0000000000000c40 D runqueues
> 
> on a configuration that is the bare minimum, no fluff. We could invest
> into looking what would be the cost of compiling out `bts_ctx` or
> `cpu_debug_store` instead of adding more if statements and making the
> code look tricky.
> 
> >
> > I agree that a hypercall could produce up to 4 KiB of output in
> > response to up to 4 KiB of input. But the guest controls how much
> > input it passes. Furthermore, for all the hypercalls I'm familiar with,
> > the specification of the hypercall tells the max amount of output it
> > will produce in response to the input. That allows the guest to
> > know how much output space it needs to allocate and provide to
> > the hypercall.
> >
> 
> > I will concede that it's possible to envision a hypercall with a
> > specification that says "May produce up to 4 KiB of output. A header
> > at the beginning of the output says how much output was produced."
> > In that case, the guest indeed must supply a full page of output space.
> > But I don't think we have any such hypercalls now, and adding such a
> > hypercall in the future seems unlikely. Of course, if such a hypercall
> > did get added and Linux used that hypercall, Linux would need to
> > supply a full page for the hypercall output. That page wouldn't
> > necessarily need to be a pre-allocated per-vCPU hypercall output
> > page. Depending on the usage circumstances, that full page might be
> > able to be allocated on-demand.
> >
> > But assume things proceed as they are today where Linux can limit
> > the amount of hypercall output based on the input. Then I don't
> > see a violation of the contract if Linux limits the output and fits
> > it within a page that is also being shared in a non-overlapping
> > way with any hypercall input. I wouldn't allocate a per-vCPU
> > hypercall output page now for a theoretically possible
> > hypercall that doesn't exist yet.
> 
> Given that:
> 
> * Using the hypercall output page is plumbed throughout the dom0 code,

I see three uses in current upstream code. And then there are close to
a dozen more uses in Nuno's year-old patch set for /dev/mshv that hasn't
been accepted upstream yet. Does that roughly match what you are referring
to?

> * dom0 and the VTL mode can share code quite naturally,

Yep.

> * Not using the output page cannot acccount for all cases permitted by
>    the TLFS clause "don't overlap input and output, and don't cross page
>    boundaries",

This is what I don’t understand.

> * Not using the output page everywhere for consistency requires updating
>    call sites of `hv_do_hypercall` and friends, i.e. every place where a
>    hypercall is made needs to incorporate the offset/pointer at which the
>    output should start, or perhaps do some shenanigans with macro's,

In my musing about getting rid of hyperv_pcpu_output_arg entirely, the
call signature of hv_do_hypercall() and friends would remain unchanged.
There would still be a virtual address passed as the output argument (which
might be NULL, and is *not* required to be page aligned). The caller of
hv_do_hypercall() must only ensure that the virtual address points to an
output area that is large enough to contain the output without crossing a
page boundary. The area must also not overlap with the input area. Given
that we currently have only 3 uses of hyperv_pcpu_output_arg in upstream
code, only those would need to be tweaked to split the hyperv_pcpu_input_arg
page into an input and output area. All other callers of hv_do_hypercall()
would be unchanged.

> * Not using the output page leads to negligible savings,
> 
> it is hard to see for me how to make not using the hypercall output page
> look as a profitable enginnering endeavor, really comes off as dancing
> to the perf scare/feature creep tune for peanuts.

There would be some simplification in hv_common_free(), hv_common_init(),
and hv_common_cpu_init() if there was no hyperv_pcpu_output_arg.  :-)

> 
> In my opinion, we should use the hypercall output page for the VTL mode
> as dom0 does to share code and reduce code churn. Had we used some
> `hv_get_registers` in both instead of the specialized for no compelling
> imo practical reason `get_vtl` (as it just gets a vCPU register, nothing
> more, nothing less), this code path would've been better tested, and any
> of this patching would not have been needed.

FWIW, the historical context is that at the time get_vtl() was implemented,
all VP registers that Linux needed to access were available as synthetic
MSRs on the x86 side. Going back to its origins 15 years ago, the Linux code
for Hyper-V always just used the synthetic MSRs. get_vtl() was the first time
that x86 code needed the HVCALL_GET_VP_REGISTERS hypercall because
HV_REGISTER_VSM_VP_STATUS has no synthetic MSR equivalent. There still
is no 'hv_get_registers' function in upstream code on the x86 side.
hv_get_vpreg() exists only on the arm64 side where there are no synthetic
MSRs or equivalent.

In hindsight, hv_get_vpreg() could have been added on the x86 side, and
then get_vtl() implemented on top of that. And I made the get_vtl() situation
worse by giving Tianyu bad advice about overlapping the output area with
the input area. :-(

I once had a manager at Microsoft who said "He reserved the right
to wake up smarter every day."  I've had to claim that right as well
from time-to-time. :-)

> 
> I'd wait for few days and then would likely prefer to run with Wei's
> permission to send this patch in v2 as-is unless some damning evidence
> presents itself.

Again, I won't object to your taking that path.

But regarding compliance with the TLFS, take a look at the code for
hv_pci_read_mmio() and hv_pci_write_mmio().  Do you think that code
violates the spec?  If not, what would a violation look like in the context
of this discussion?  If current code is in violation, then we should fix that.

Michael