Re: How to reserve guest physical region for ACPI

Laszlo Ersek <lersek@xxxxxxxxxx> · Wed, 6 Jan 2016 15:43:40 +0100

On 01/06/16 14:39, Igor Mammedov wrote:
> On Tue, 5 Jan 2016 18:22:33 +0100
> Laszlo Ersek <lersek@xxxxxxxxxx> wrote:
> 
>> On 01/05/16 18:08, Igor Mammedov wrote:
>>> On Mon, 4 Jan 2016 21:17:31 +0100
>>> Laszlo Ersek <lersek@xxxxxxxxxx> wrote:
>>>   
>>>> Michael CC'd me on the grandparent of the email below. I'll try to add
>>>> my thoughts in a single go, with regard to OVMF.
>>>>
>>>> On 12/30/15 20:52, Michael S. Tsirkin wrote:  
>>>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:    
>>>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>>>> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
>>>>>>    
>>>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:    
>>>>>>>>
>>>>>>>> Hi Michael, Paolo,
>>>>>>>>
>>>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>>>> physical region internally used by ACPI.
>>>>>>>>
>>>>>>>> Igor suggested that:
>>>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)    
>>>>
>>>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>>>> reason is that nobody wrote that patch, nor asked for the patch to be
>>>> written. (Not implying that just requesting the patch would be
>>>> sufficient for the patch to be written.)
>>>>  
>>>>>>> I don't want to tie things to reserved-memory-end because this
>>>>>>> does not scale: next time we need to reserve memory,
>>>>>>> we'll need to find yet another way to figure out what is where.    
>>>>>> Could you elaborate a bit more on a problem you're seeing?
>>>>>>
>>>>>> To me it looks like it scales rather well.
>>>>>> For example lets imagine that we adding a device
>>>>>> that has some on device memory that should be mapped into GPA
>>>>>> code to do so would look like:
>>>>>>
>>>>>>   pc_machine_device_plug_cb(dev)
>>>>>>   {
>>>>>>    ...
>>>>>>    if (dev == OUR_NEW_DEVICE_TYPE) {
>>>>>>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
>>>>>>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
>>>>>>    }
>>>>>>   }
>>>>>>
>>>>>> we can practically add any number of new devices that way.    
>>>>>
>>>>> Yes but we'll have to build a host side allocator for these, and that's
>>>>> nasty. We'll also have to maintain these addresses indefinitely (at
>>>>> least per machine version) as they are guest visible.
>>>>> Not only that, there's no way for guest to know if we move things
>>>>> around, so basically we'll never be able to change addresses.
>>>>>
>>>>>     
>>>>>>      
>>>>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>>>>>> support 64 bit RAM instead    
>>>>
>>>> This looks quite doable in OVMF, as long as the blob to allocate from
>>>> high memory contains *zero* ACPI tables.
>>>>
>>>> (
>>>> Namely, each ACPI table is installed from the containing fw_cfg blob
>>>> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
>>>> own allocation policy for the *copies* of ACPI tables it installs.
>>>>
>>>> This allocation policy is left unspecified in the section of the UEFI
>>>> spec that governs EFI_ACPI_TABLE_PROTOCOL.
>>>>
>>>> The current policy in edk2 (= the reference implementation) seems to be
>>>> "allocate from under 4GB". It is currently being changed to "try to
>>>> allocate from under 4GB, and if that fails, retry from high memory". (It
>>>> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
>>>> )
>>>>  
>>>>>>> (and maybe a way to allocate and
>>>>>>> zero-initialize buffer without loading it through fwcfg),    
>>>>
>>>> Sounds reasonable.
>>>>  
>>>>>>> this way bios
>>>>>>> does the allocation, and addresses can be patched into acpi.    
>>>>>> and then guest side needs to parse/execute some AML that would
>>>>>> initialize QEMU side so it would know where to write data.    
>>>>>
>>>>> Well not really - we can put it in a data table, by itself
>>>>> so it's easy to find.    
>>>>
>>>> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
>>>> acpi_get_table_with_size()?
>>>>  
>>>>>
>>>>> AML is only needed if access from ACPI is desired.
>>>>>
>>>>>     
>>>>>> bios-linker-loader is a great interface for initializing some
>>>>>> guest owned data and linking it together but I think it adds
>>>>>> unnecessary complexity and is misused if it's used to handle
>>>>>> device owned data/on device memory in this and VMGID cases.    
>>>>>
>>>>> I want a generic interface for guest to enumerate these things.  linker
>>>>> seems quite reasonable but if you see a reason why it won't do, or want
>>>>> to propose a better interface, fine.    
>>>>
>>>> * The guest could do the following:
>>>> - while processing the ALLOCATE commands, it would make a note where in
>>>> GPA space each fw_cfg blob gets allocated
>>>> - at the end the guest would prepare a temporary array with a predefined
>>>> record format, that associates each fw_cfg blob's name with the concrete
>>>> allocation address
>>>> - it would create an FWCfgDmaAccess stucture pointing at this array,
>>>> with a new "control" bit set (or something similar)
>>>> - the guest could write the address of the FWCfgDmaAccess struct to the
>>>> appropriate register, as always.
>>>>
>>>> * Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
>>>> specifying:
>>>> - the fw_cfg blob's name, for which to retrieve the guest-allocated
>>>>   address (this command could only follow the matching ALLOCATE
>>>>   command, never precede it)
>>>> - a flag whether the address should be written to IO or MMIO space
>>>>   (would be likely IO on x86, MMIO on ARM)
>>>> - a unique uint64_t key (could be the 16-bit fw_cfg selector value that
>>>>   identifies the blob, actually!)
>>>> - a uint64_t (IO or MMIO) address to write the unique key and then the
>>>>   allocation address to.
>>>>
>>>> Either way, QEMU could learn about all the relevant guest-side
>>>> allocation addresses in a low number of traps. In addition, AML code
>>>> wouldn't have to reflect any allocation addresses to QEMU, ever.  
>>
>>> That would be nice trick. I see 2 issues here:
>>>  1. ACPI tables blob is build atomically when one guest tries to read it
>>>     from fw_cfg so patched addresses have to be communicated
>>>     to QEMU before that.  
>>
>> I don't understand issue #1. I think it is okay if the allocation
>> happens strictly after QEMU refreshes / regenerates the ACPI payload.
>> Namely, the guest-allocated addresses have two uses:
>> - references from within the ACPI payload
> If references are from AML, then AML should be patched by linker,
> which is tricky and forces us to invent duplicate AML API that
> would be able to tell linker where AML object should be patched
> (Michael's patch in this thread as example)

Yes, such minimal AML patching is necessary.

> It would be better if linker would communicate addresses to QEMU
> before AML is built, so that AML would use already present
> in QEMU addresses and doesn't have to be patched at all.

I dislike this.

First, this would duplicate part of the linker's functionality in the host.

Second, it would lead to an ugly ping-pong between host and guest. First
QEMU has to create the full ACPI payload, with placeholder constants in
the AML. Then the guest could retrieve the *size* of the ACPI payload
(the fw_cf gblobs), and perform the allocations. Then QEMU would fix up
the AML. Then the guest would download the fw_cfg blobs. Then the guest
linker would fix up the data tables. Ugly ugly ugly.

I think Michael's and Xiao Guangrong's solutions to the minimal AML
patching (= patch named dword / qword object, or the constant return
value in a minimal method) is quite feasible.

How about this:

+------------------+            +-----------------------+
|Single DWORD      |            | 4KB system memory     |
|object or         |            | operation region      | ---------+
|DWORD-returning   |            | hosting a single      |          |
|method in AML,    | ---------> | "QEMU parameter"      | -----+   |
|to be patched with|            | structure, with       |      |   |
|Michael's or      |            | pointers, small       |      |   |
|Xiao Guangrong's  |            | scalars, and padding. |      |   |
|trick             |            | Call this QPRM ("QEMU |      |   |
+------------------+            | parameters").         |      |   |
                                +-----------------------+      |   |
                                                               |   |
                                +-----------------------+ <----+   |
                                | "NRAM" operation      |          |
                                | region for NVDIMM     |          |
                                +-----------------------+          |
                                                                   |
                                +--------------------------+       |
                                | Another operation region | <-----+
                                | for another device       |
                                +--------------------------+

                                ...

Here's the idea formulated in a sequence of steps:

(1) In QEMU, create a single DWORD object, or DWORD-returning simple
method, that we *do* patch, with Michael's or Xiao Guangrong's trick,
using the ACPI linker. This would be the *only* such trick.

(2) This object or method would provide the GPA of a 4KB fw_cfg blob.
This fw_cfg blob would start with 36 zero bytes (for reasons specific to
OVMF; let me skip those for now). The rest of the blob would carry a
structure that we would actually define in the QEMU source code, as a type.

Fields of this structure would be:
- pointers (4-byte or 8-byte)
- small scalars (like a 128-bit GUID)
- padding

This structure would be named QPRM ("QEMU parameters").

(3) We add an *unconditional*, do-nothing device to the DSDT whose
initialization function evaluates the DWORD object (or DWORD-returning
method), and writes the result (= the guest-allocated address of QPRM)
to a hard-coded IO (or MMIO) port range.

(4) This port range would be backed by a *single* MemoryRegion in QEMU,
and the address written by the guest would be stored in a global
variable (or a singleton object anyway).

(5) In the guest AML, an unconditional QPRM operation region would
overlay the blob, with fields matching the QEMU structure type.

(6) Whenever a new device is introduced in QEMU that needs a dedicated
system memory operation region in the guest (nvdimm, vmgenid), we add a
new field to QPRM.

If the required region is very small (just a few scalars, like with
vmgenid), then the field is placed directly in QPRM (with the necessary
padding).

Otherwise (see the NRAM operation region for nvdimm) we add a new fw_cfg
blob, and an ADD_POINTER command for relocating the referencing field in
QPRM.

(7) The device models in QEMU can follow the pointers in guest memory,
from the initially stashed address of QPRM, through the necessary
pointer fields in QPRM, to the final operation regions.

(8) The device model-specific AML in the guest can do the same
traversal. It can fetch the right pointer field from QPRM, and define a
new operation region (like NRAM) based on that value.

All in all this is just another layer of indirection, same as the
DataTableRegion idea, except that the parameter table would be located
by a central patched DWORD object or method, not by ACPI SDT signature /
OEM ID / OEM table ID.

If we can agree on this, I could work on the device model-independent
steps (1-5), and perhaps do (6) and (8) for vmgenid on top.

Thanks
Laszlo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html