Re: How to reserve guest physical region for ACPI

Laszlo Ersek <lersek@xxxxxxxxxx> · Tue, 5 Jan 2016 18:22:33 +0100

On 01/05/16 18:08, Igor Mammedov wrote:
> On Mon, 4 Jan 2016 21:17:31 +0100
> Laszlo Ersek <lersek@xxxxxxxxxx> wrote:
> 
>> Michael CC'd me on the grandparent of the email below. I'll try to add
>> my thoughts in a single go, with regard to OVMF.
>>
>> On 12/30/15 20:52, Michael S. Tsirkin wrote:
>>> On Wed, Dec 30, 2015 at 04:55:54PM +0100, Igor Mammedov wrote:  
>>>> On Mon, 28 Dec 2015 14:50:15 +0200
>>>> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
>>>>  
>>>>> On Mon, Dec 28, 2015 at 10:39:04AM +0800, Xiao Guangrong wrote:  
>>>>>>
>>>>>> Hi Michael, Paolo,
>>>>>>
>>>>>> Now it is the time to return to the challenge that how to reserve guest
>>>>>> physical region internally used by ACPI.
>>>>>>
>>>>>> Igor suggested that:
>>>>>> | An alternative place to allocate reserve from could be high memory.
>>>>>> | For pc we have "reserved-memory-end" which currently makes sure
>>>>>> | that hotpluggable memory range isn't used by firmware
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg00926.html)  
>>
>> OVMF has no support for the "reserved-memory-end" fw_cfg file. The
>> reason is that nobody wrote that patch, nor asked for the patch to be
>> written. (Not implying that just requesting the patch would be
>> sufficient for the patch to be written.)
>>
>>>>> I don't want to tie things to reserved-memory-end because this
>>>>> does not scale: next time we need to reserve memory,
>>>>> we'll need to find yet another way to figure out what is where.  
>>>> Could you elaborate a bit more on a problem you're seeing?
>>>>
>>>> To me it looks like it scales rather well.
>>>> For example lets imagine that we adding a device
>>>> that has some on device memory that should be mapped into GPA
>>>> code to do so would look like:
>>>>
>>>>   pc_machine_device_plug_cb(dev)
>>>>   {
>>>>    ...
>>>>    if (dev == OUR_NEW_DEVICE_TYPE) {
>>>>        memory_region_add_subregion(as, current_reserved_end, &dev->mr);
>>>>        set_new_reserved_end(current_reserved_end + memory_region_size(&dev->mr));
>>>>    }
>>>>   }
>>>>
>>>> we can practically add any number of new devices that way.  
>>>
>>> Yes but we'll have to build a host side allocator for these, and that's
>>> nasty. We'll also have to maintain these addresses indefinitely (at
>>> least per machine version) as they are guest visible.
>>> Not only that, there's no way for guest to know if we move things
>>> around, so basically we'll never be able to change addresses.
>>>
>>>   
>>>>    
>>>>> I would like ./hw/acpi/bios-linker-loader.c interface to be extended to
>>>>> support 64 bit RAM instead  
>>
>> This looks quite doable in OVMF, as long as the blob to allocate from
>> high memory contains *zero* ACPI tables.
>>
>> (
>> Namely, each ACPI table is installed from the containing fw_cfg blob
>> with EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), and the latter has its
>> own allocation policy for the *copies* of ACPI tables it installs.
>>
>> This allocation policy is left unspecified in the section of the UEFI
>> spec that governs EFI_ACPI_TABLE_PROTOCOL.
>>
>> The current policy in edk2 (= the reference implementation) seems to be
>> "allocate from under 4GB". It is currently being changed to "try to
>> allocate from under 4GB, and if that fails, retry from high memory". (It
>> is motivated by Aarch64 machines that may have no DRAM at all under 4GB.)
>> )
>>
>>>>> (and maybe a way to allocate and
>>>>> zero-initialize buffer without loading it through fwcfg),  
>>
>> Sounds reasonable.
>>
>>>>> this way bios
>>>>> does the allocation, and addresses can be patched into acpi.  
>>>> and then guest side needs to parse/execute some AML that would
>>>> initialize QEMU side so it would know where to write data.  
>>>
>>> Well not really - we can put it in a data table, by itself
>>> so it's easy to find.  
>>
>> Do you mean acpi_tb_find_table(), acpi_get_table_by_index() /
>> acpi_get_table_with_size()?
>>
>>>
>>> AML is only needed if access from ACPI is desired.
>>>
>>>   
>>>> bios-linker-loader is a great interface for initializing some
>>>> guest owned data and linking it together but I think it adds
>>>> unnecessary complexity and is misused if it's used to handle
>>>> device owned data/on device memory in this and VMGID cases.  
>>>
>>> I want a generic interface for guest to enumerate these things.  linker
>>> seems quite reasonable but if you see a reason why it won't do, or want
>>> to propose a better interface, fine.  
>>
>> * The guest could do the following:
>> - while processing the ALLOCATE commands, it would make a note where in
>> GPA space each fw_cfg blob gets allocated
>> - at the end the guest would prepare a temporary array with a predefined
>> record format, that associates each fw_cfg blob's name with the concrete
>> allocation address
>> - it would create an FWCfgDmaAccess stucture pointing at this array,
>> with a new "control" bit set (or something similar)
>> - the guest could write the address of the FWCfgDmaAccess struct to the
>> appropriate register, as always.
>>
>> * Another idea would be a GET_ALLOCATION_ADDRESS linker/loader command,
>> specifying:
>> - the fw_cfg blob's name, for which to retrieve the guest-allocated
>>   address (this command could only follow the matching ALLOCATE
>>   command, never precede it)
>> - a flag whether the address should be written to IO or MMIO space
>>   (would be likely IO on x86, MMIO on ARM)
>> - a unique uint64_t key (could be the 16-bit fw_cfg selector value that
>>   identifies the blob, actually!)
>> - a uint64_t (IO or MMIO) address to write the unique key and then the
>>   allocation address to.
>>
>> Either way, QEMU could learn about all the relevant guest-side
>> allocation addresses in a low number of traps. In addition, AML code
>> wouldn't have to reflect any allocation addresses to QEMU, ever.

> That would be nice trick. I see 2 issues here:
>  1. ACPI tables blob is build atomically when one guest tries to read it
>     from fw_cfg so patched addresses have to be communicated
>     to QEMU before that.

I don't understand issue #1. I think it is okay if the allocation
happens strictly after QEMU refreshes / regenerates the ACPI payload.
Namely, the guest-allocated addresses have two uses:
- references from within the ACPI payload
- references from the QEMU side, for device operation.

The first purpose is covered by the linker/loader itself (that is,
GET_ALLOCATION_ADDRESS would be used *in addition* to ADD_POINTER). The
second purpose would be covered by GET_ALLOCATION_ADDRESS.

>  2. Mo important I think that we are miss-using linker-loader
>     interface here, trying to from allocate buffer in guest RAM
>     an so consuming it while all we need a window into device
>     memory mapped somewhere outside of RAM occupied  address space.

But, more importantly, I definitely see your point with issue #2. I'm
neutral on the question whether this should be solved with the ACPI
linker/loader or with something else. I'm perfectly fine with "something
else", as long as it is generic enough. The above GET_ALLOCATION_ADDRESS
idea is relevant *only if* the ACPI linker/loader is deemed the best
solution here.

(Heck, if the linker/loader avenue is rejected here, that's the least
work for me! :))

Thanks
Laszlo

> 
>>
>>>
>>> PCI would do, too - though windows guys had concerns about
>>> returning PCI BARs from ACPI.
>>>
>>>   
>>>> There was RFC on list to make BIOS boot from NVDIMM already
>>>> doing some ACPI table lookup/parsing. Now if they were forced
>>>> to also parse and execute AML to initialize QEMU with guest
>>>> allocated address that would complicate them quite a bit.  
>>>
>>> If they just need to find a table by name, it won't be
>>> too bad, would it?
>>>   
>>>> While with NVDIMM control memory region mapped directly by QEMU,
>>>> respective patches don't need in any way to initialize QEMU,
>>>> all they would need just read necessary data from control region.
>>>>
>>>> Also using bios-linker-loader takes away some usable RAM
>>>> from guest and in the end that doesn't scale,
>>>> the more devices I add the less usable RAM is left for guest OS
>>>> while all the device needs is a piece of GPA address space
>>>> that would belong to it.  
>>>
>>> I don't get this comment. I don't think it's MMIO that is wanted.
>>> If it's backed by qemu virtual memory then it's RAM.
>>>   
>>>>>
>>>>> See patch at the bottom that might be handy.  
>>
>> I've given up on Microsoft implementing DataTableRegion. (It's sad, really.)
>>
>> From last year I have a WIP version of "docs/vmgenid.txt" that is based
>> on Michael's build_append_named_dword() function. If
>> GET_ALLOCATION_ADDRESS above looks good, then I could simplify the ACPI
>> stuff in that text file (and hopefully post it soon after for comments?)
>>
>>>>>  
>>>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
>>>>>> | when writing ASL one shall make sure that only XP supported
>>>>>> | features are in global scope, which is evaluated when tables
>>>>>> | are loaded and features of rev2 and higher are inside methods.
>>>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
>>>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)  
>>>>>
>>>>> Yes, this technique works.  
>>
>> Agreed.
>>
>>>>>
>>>>> An alternative is to add an XSDT, XP ignores that.
>>>>> XSDT at the moment breaks OVMF (because it loads both
>>>>> the RSDT and the XSDT, which is wrong), but I think
>>>>> Laszlo was working on a fix for that.  
>>
>> We have to distinguish two use cases here.
>>
>> * The first is the case when QEMU prepares both an XSDT and an RSDT, and
>> links at least one common ACPI table from both. This would cause OVMF to
>> pass the same source (= to-be-copied) table to
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() twice, with one of the
>> following outcomes:
>>
>> - there would be two instances of the same table (think e.g. SSDT)
>> - the second attempt would be rejected (e.g. FADT) and that error would
>>   terminate the linker-loader procedure.
>>
>> This issue would not be too hard to overcome, with a simple "memoization
>> technique". After the initial loading & linking of the tables, OVMF
>> could remember the addresses of the "source" ACPI tables, and could
>> avoid passing already installed source tables to
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() for a second time.
>>
>> * The second use case is when an ACPI table is linked *only* from QEMU's
>> XSDT. This is much harder to fix, because
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable() in edk2 links the copy of the
>> passed-in table into *both* RSDT and XSDT, automatically. And, again,
>> the UEFI spec doesn't provide a way to control this from the caller
>> (i.e. from within OVMF).
>>
>> I have tried earlier to effect a change in the specification of
>> EFI_ACPI_TABLE_PROTOCOL.InstallAcpiTable(), on the ASWG and USWG mailing
>> lists. (At that time I was trying to expose UEFI memory *type* to the
>> caller, from which the copy of the ACPI table being installed should be
>> allocated from.) Alas, I received no answers at all.
>>
>> All in all I strongly recommend the "place rev2+ objects in method
>> scope" trick, over the "link it from the XSDT only" trick.
>>
>>>> Using XSDT would increase ACPI tables occupied RAM
>>>> as it would duplicate DSDT + non XP supported AML
>>>> at global namespace.  
>>>
>>> Not at all - I posted patches linking to same
>>> tables from both RSDT and XSDT at some point.  
>>
>> Yes, at <http://thread.gmane.org/gmane.comp.emulators.qemu/342559>. This
>> could be made work in OVMF with the above mentioned memoization stuff.
>>
>>> Only the list of pointers would be different.  
>>
>> I don't recommend that, see the second case above.
>>
>> Thanks
>> Laszlo
>>
>>>> So far we've managed keep DSDT compatible with XP while
>>>> introducing features from v2 and higher ACPI revisions as
>>>> AML that is only evaluated on demand.
>>>> We can continue doing so unless we have to unconditionally
>>>> add incompatible AML at global scope.
>>>>  
>>>
>>> Yes.
>>>   
>>>>>  
>>>>>> Michael, Paolo, what do you think about these ideas?
>>>>>>
>>>>>> Thanks!  
>>>>>
>>>>>
>>>>>
>>>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
>>>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
>>>>> current offset so we can add that to the linker.
>>>>>
>>>>> Won't work if you append the Name to the Aml structure (these can be
>>>>> nested to arbitrary depth using aml_append), so using plain GArray for
>>>>> this API makes sense to me.
>>>>>  
>>>>> --->  
>>>>>
>>>>> acpi: add build_append_named_dword, returning an offset in buffer
>>>>>
>>>>> This is a very limited form of support for runtime patching -
>>>>> similar in functionality to what we can do with ACPI_EXTRACT
>>>>> macros in python, but implemented in C.
>>>>>
>>>>> This is to allow ACPI code direct access to data tables -
>>>>> which is exactly what DataTableRegion is there for, except
>>>>> no known windows release so far implements DataTableRegion.  
>>>> unsupported means Windows will BSOD, so it's practically
>>>> unusable unless MS will patch currently existing Windows
>>>> versions.  
>>>
>>> Yes. That's why my patch allows patching SSDT without using
>>> DataTableRegion.
>>>   
>>>> Another thing about DataTableRegion is that ACPI tables are
>>>> supposed to have static content which matches checksum in
>>>> table the header while you are trying to use it for dynamic
>>>> data. It would be cleaner/more compatible to teach
>>>> bios-linker-loader to just allocate memory and patch AML
>>>> with the allocated address.  
>>>
>>> Yes - if address is static, you need to put it outside
>>> the table. Can come right before or right after this.
>>>   
>>>> Also if OperationRegion() is used, then one has to patch
>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>> using variable names is not permitted there.  
>>>
>>> I am not sure the comment was understood correctly.
>>> The comment says really "we can't use DataTableRegion
>>> so here is an alternative".
>>>   
>>>>  
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
>>>>>
>>>>> ---
>>>>>
>>>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
>>>>> index 1b632dc..f8998ea 100644
>>>>> --- a/include/hw/acpi/aml-build.h
>>>>> +++ b/include/hw/acpi/aml-build.h
>>>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>>>>>  void
>>>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>>>>>  
>>>>> +int
>>>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
>>>>> +
>>>>>  #endif
>>>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
>>>>> index 0d4b324..7f9fa65 100644
>>>>> --- a/hw/acpi/aml-build.c
>>>>> +++ b/hw/acpi/aml-build.c
>>>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>>>>>      }
>>>>>  }
>>>>>  
>>>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
>>>>> + * and return the offset to 0x0 for runtime patching.
>>>>> + *
>>>>> + * Warning: runtime patching is best avoided. Only use this as
>>>>> + * a replacement for DataTableRegion (for guests that don't
>>>>> + * support it).
>>>>> + */
>>>>> +int
>>>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
>>>>> +{
>>>>> +    int offset;
>>>>> +    va_list ap;
>>>>> +
>>>>> +    va_start(ap, name_format);
>>>>> +    build_append_namestringv(array, name_format, ap);
>>>>> +    va_end(ap);
>>>>> +
>>>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
>>>>> +
>>>>> +    offset = array->len;
>>>>> +    build_append_int_noprefix(array, 0x0, 8);
>>>>> +    assert(array->len == offset + 8);
>>>>> +
>>>>> +    return offset;
>>>>> +}
>>>>> +
>>>>>  static GPtrArray *alloc_list;
>>>>>  
>>>>>  static Aml *aml_alloc(void)
>>>>>
>>>>>  
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html