From: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, December 19, 2024 3:39 PM > > On 12/19/2024 1:37 PM, Michael Kelley wrote: > > From: Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, December 19, > 2024 10:19 AM > > [...] > > >> > >> There will surely be more hypercall usage in the VTL mode that return > >> data and require the output pages as we progress with upstreaming the > >> VTL patches. Enabling the hypercall output pages allows to fix the > >> function in question in a very natural way, making it possible to > >> replace with some future `hv_get_vp_register` that would work for both > >> dom0 and VTL mode just the same. > >> > >> All told, if you believe that we should make this patch a one-liner, > >> I'll do as you suggested. > >> > > > > FWIW, Roman and I had this same discussion back in August. See [1]. > > I'll add one new thought that wasn't part of the August discussion. > Michael, thank you very much for helping out in finding a better > solution! > > > > > To my knowledge, the hypercalls that *may* use a full page for input > > and/or a full page for output don't actually *require* a full page. The > > size of the input and output depends on how many "entries" the > > hypercall is specified to process, where "entries" could be registers, > > memory PFNs, or whatever. I would expect the code to invoke these > > hypercalls must already deal with the case where the requested number > > of entries causes the input or output size to exceed one page, so the > > code just iterates making multiple invocations of the hypercall with > > a "batch size" that fits in one page. > That is what I see in the code, too. The TLFS requires to use a page > worth of data maximum ("cannot overlap or cross page boundaries"), hence > the hypercall code shall chunk up the input data appropriately. > > > > > It would be perfectly reasonable to limit the batch size so that a > > "batch" of input or output fits in a half page instead of a full page, > > avoiding the need to allocate hyperv_pcpu_output_arg. Or if the > > input and output sizes are not equal, use whatever input vs. output > > partitioning of a single page make sense for that hypercall. The > > tradeoff, of course, is having to make the hypercall more times > > with smaller batches. But if the hypercall is not in a hot path, this > > might be a reasonable tradeoff to avoid the additional memory > > allocation. Or if the hypercall processing time per "entry" is high, > > the added overhead of more invocations with smaller batches is > > probably negligible compared to the time processing the entries. > The hypervisor yields control back to the guest if the hypervisor > spends more than ~a dozen 1e-6 sec in the hypercall processing, and > the processing isn't done yet. When yielding the control back, the > hypervisor doesn't advance the instruction pointer so the guest can > process interrupts on the vCPU (if the guest didn't mask them), and > get back to processing the hypercall. That helps the guest stay > responsive if the guest chose to send larger batches. For smaller > batches, have to consider the cost of invoking the hypercall as you > are pointing out. On the topic of saving CPU time, there are also fast > hypercalls that pass data in the CPU registers. > > > > > This scheme could also be used in the existing root partition code > > that is currently the only user of the hyperv_pcpu_output_arg. > > I could see a valid argument being made to drop > > hyperv_pcpu_output_arg entirely and just use smaller batches. > In my view, that is a reasonable idea to cut down on memory usage. > Here is what I get applying that general idea to this specific > situation (and sticking to 4KiB as the page size). > > We are going to save 4KiB * N of memory where N is the number of cores > allocated to the root partition. Let us also introduce M that denotes > the amount of memory in KiB allocated to the root partition. > > Given that the order of magnitude for N is 1 (log_10(N) ~= 1), and the > order of magnitude for M is 6..7, the savings (~=10^(N-M)=1e-5) look > rather meager to my eye. That might be a judgement call, I wouldn't > argue that. I would agree that the percentage savings is small. VMs often have several hundred MiB to a few GiB of memory per vCPU. Saving a 4K page out of that amount of memory is a small percentage. The thing to guard against, though, is applying that logic in many different places in Linux kernel code. :-) The Hyper-V support in Linux already has multiple pre-allocated per-vCPU pages, and by being a little bit clever we might be able to avoid another one. > > What is worrisome is that the guest goes against the specification. The > specification decrees: the input and output areas for the hypercall > shall not cross page boundaries and shall not overlap. > Hence, the hypervisor is within its right to produce up to 4KiB of > output in response to up to 4KiBs of input, and we have: > > ``` > sizeof(input) + sizeof(output) <= 2*sizeof(page) > ``` > > But when the guest doesn't use the output page, we obviously have > > ``` > sizeof(input) + sizeof(output) <= sizeof(page) > ``` > on the guest side so the contract is broken. I agree that a hypercall could produce up to 4 KiB of output in response to up to 4 KiB of input. But the guest controls how much input it passes. Furthermore, for all the hypercalls I'm familiar with, the specification of the hypercall tells the max amount of output it will produce in response to the input. That allows the guest to know how much output space it needs to allocate and provide to the hypercall. I will concede that it's possible to envision a hypercall with a specification that says "May produce up to 4 KiB of output. A header at the beginning of the output says how much output was produced." In that case, the guest indeed must supply a full page of output space. But I don't think we have any such hypercalls now, and adding such a hypercall in the future seems unlikely. Of course, if such a hypercall did get added and Linux used that hypercall, Linux would need to supply a full page for the hypercall output. That page wouldn't necessarily need to be a pre-allocated per-vCPU hypercall output page. Depending on the usage circumstances, that full page might be able to be allocated on-demand. But assume things proceed as they are today where Linux can limit the amount of hypercall output based on the input. Then I don't see a violation of the contract if Linux limits the output and fits it within a page that is also being shared in a non-overlapping way with any hypercall input. I wouldn't allocate a per-vCPU hypercall output page now for a theoretically possible hypercall that doesn't exist yet. Michael > > The hypervisor would need to know that the guest optimizes its memory > usage in this way, limiting what is allowed by the specification when > implementing any new hypercalls. > > > > > Or are there hypercalls where a smaller batch size doesn't work > > at all or is a bad tradeoff for performance reasons? I know I'm not > > familiar with *all* the hypercalls that might be used in root > > partition or VTL code. If there are such hypercalls, I would be curious > > to know about them. > Nothing that I could find in the specification. I wouldn't think that > justifies investing in creating specialized/special-cased functions > on the guest side without solid evidence that more code is needed. In > this particular case, one day, I'd love to replace `get_vtl` with > one generic function `hv_get_vp_registers` that works both for dom0 and > VTLs, plays by the book and does not require much/any explaining what is > going on inside it and why. I believe this will make maintenance easier. > > > > > > Michael > > > > [1] https://lore.kernel.org/linux-hyperv/SN6PR02MB415759676AEF931F030430FDD4BE2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > > -- > Thank you, > Roman