Re: [Qemu-devel] How to reserve guest physical region for ACPI

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01/07/16 14:42, Igor Mammedov wrote:
> On Thu, 7 Jan 2016 12:54:30 +0200
> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
> 
>> On Thu, Jan 07, 2016 at 11:30:25AM +0100, Igor Mammedov wrote:
>>> On Tue, 5 Jan 2016 18:43:02 +0200
>>> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
>>>   
>>>> On Tue, Jan 05, 2016 at 05:30:25PM +0100, Igor Mammedov wrote:  
>>>>>>> bios-linker-loader is a great interface for initializing some
>>>>>>> guest owned data and linking it together but I think it adds
>>>>>>> unnecessary complexity and is misused if it's used to handle
>>>>>>> device owned data/on device memory in this and VMGID cases.      
>>>>>>
>>>>>> I want a generic interface for guest to enumerate these things.  linker
>>>>>> seems quite reasonable but if you see a reason why it won't do, or want
>>>>>> to propose a better interface, fine.
>>>>>>
>>>>>> PCI would do, too - though windows guys had concerns about
>>>>>> returning PCI BARs from ACPI.    
>>>>> There were potential issues with pSeries bootloader that treated
>>>>> PCI_CLASS_MEMORY_RAM as conventional RAM but it was fixed.
>>>>> Could you point out to discussion about windows issues?
>>>>>
>>>>> What VMGEN patches that used PCI for mapping purposes were
>>>>> stuck at, was that it was suggested to use PCI_CLASS_MEMORY_RAM
>>>>> class id but we couldn't agree on it.
>>>>>
>>>>> VMGEN v13 with full discussion is here
>>>>> https://patchwork.ozlabs.org/patch/443554/
>>>>> So to continue with this route we would need to pick some other
>>>>> driver less class id so windows won't prompt for driver or
>>>>> maybe supply our own driver stub to guarantee that no one
>>>>> would touch it. Any suggestions?    
>>>>
>>>> Pick any device/vendor id pair for which windows specifies no driver.
>>>> There's a small risk that this will conflict with some
>>>> guest but I think it's minimal.  
>>> device/vendor id pair was QEMU specific so doesn't conflicts with anything
>>> issue we were trying to solve was to prevent windows asking for driver
>>> even though it does so only once if told not to ask again.
>>>
>>> That's why PCI_CLASS_MEMORY_RAM was selected as it's generic driver-less
>>> device descriptor in INF file which matches as the last resort if
>>> there isn't any other diver that's matched device by device/vendor id pair.  
>>
>> I think this is the only class in this inf.
>> If you can't use it, you must use an existing device/vendor id pair,
>> there's some risk involved but probably not much.
> I can't wrap my head around this answer, could you rephrase it?
> 
> As far as I see we can use PCI_CLASS_MEMORY_RAM with qemu's device/vendor ids.
> In that case Windows associates it with dummy "Generic RAM controller".
> 
> The same happens with some NVIDIA cards if NVIDIA drivers are not installed,
> if we install drivers then Windows binds NVIDIA's PCI_CLASS_MEMORY_RAM with
> concrete driver that manages VRAM the way NVIDIA wants it.
> 
> So I think we can use it with low risk.
> 
> If we use existing device/vendor id pair with some driver then driver
> will fail to initialize and as minimum we would get device marked as
> not working in Device-Manager. Any way if you have in mind a concrete
> existing device/vendor id pair feel free to suggest it.
> 
>>
>>>>
>>>>   
>>>>>>
>>>>>>     
>>>>>>> There was RFC on list to make BIOS boot from NVDIMM already
>>>>>>> doing some ACPI table lookup/parsing. Now if they were forced
>>>>>>> to also parse and execute AML to initialize QEMU with guest
>>>>>>> allocated address that would complicate them quite a bit.      
>>>>>>
>>>>>> If they just need to find a table by name, it won't be
>>>>>> too bad, would it?    
>>>>> that's what they were doing scanning memory for static NVDIMM table.
>>>>> However if it were DataTable, BIOS side would have to execute
>>>>> AML so that the table address could be told to QEMU.    
>>>>
>>>> Not at all. You can find any table by its signature without
>>>> parsing AML.  
>>> yep, and then BIOS would need to tell its address to QEMU
>>> writing to IO port which is allocated statically in QEMU
>>> for this purpose and is described in AML only on guest side.  
>>
>> io ports are an ABI too but they are way easier to
>> maintain.
> It's pretty much the same as GPA addresses only it's much more limited resource.
> Otherwise one has to do the same tricks to maintain ABI.
> 
>>
>>>>
>>>>   
>>>>> In case of direct mapping or PCI BAR there is no need to initialize
>>>>> QEMU side from AML.
>>>>> That also saves us IO port where this address should be written
>>>>> if bios-linker-loader approach is used.
>>>>>     
>>>>>>     
>>>>>>> While with NVDIMM control memory region mapped directly by QEMU,
>>>>>>> respective patches don't need in any way to initialize QEMU,
>>>>>>> all they would need just read necessary data from control region.
>>>>>>>
>>>>>>> Also using bios-linker-loader takes away some usable RAM
>>>>>>> from guest and in the end that doesn't scale,
>>>>>>> the more devices I add the less usable RAM is left for guest OS
>>>>>>> while all the device needs is a piece of GPA address space
>>>>>>> that would belong to it.      
>>>>>>
>>>>>> I don't get this comment. I don't think it's MMIO that is wanted.
>>>>>> If it's backed by qemu virtual memory then it's RAM.    
>>>>> Then why don't allocate video card VRAM the same way and try to explain
>>>>> user that a guest started with '-m 128 -device cirrus-vga,vgamem_mb=64Mb'
>>>>> only has 64Mb of available RAM because of we think that on device VRAM
>>>>> is also RAM.
>>>>>
>>>>> Maybe I've used MMIO term wrongly here but it roughly reflects the idea
>>>>> that on device memory (whether it's VRAM, NVDIMM control block or VMGEN
>>>>> area) is not allocated from guest's usable RAM (as described in E820)
>>>>> but rather directly mapped in guest's GPA and doesn't consume available
>>>>> RAM as guest sees it. That's also the way it's done on real hardware.
>>>>>
>>>>> What we need in case of VMGEN ID and NVDIMM is on device memory
>>>>> that could be directly accessed by guest.
>>>>> Both direct mapping or PCI BAR do that job and we could use simple
>>>>> static AML without any patching.    
>>>>
>>>> At least with VMGEN the issue is that there's an AML method
>>>> that returns the physical address.
>>>> Then if guest OS moves the BAR (which is legal), it will break
>>>> since caller has no way to know it's related to the BAR.  
>>> I've found a following MS doc "Firmware Allocation of PCI Device Resources in Windows". It looks like when MS implemented resource rebalancing in
>>> Vista they pushed a compat change to PCI specs.
>>> That ECN is called "Ignore PCI Boot Configuration_DSM Function"
>>> and can be found here:
>>> https://pcisig.com/sites/default/files/specification_documents/ECR-Ignorebootconfig-final.pdf
>>>
>>> It looks like it's possible to forbid rebalancing per
>>> device/bridge if it has _DMS method that returns "do not
>>> ignore the boot configuration of PCI resources".  
>>
>> I'll have to study this but we don't want that
>> globally, do we?
> no need to do it globally, adding _DSM to a device, we don't wish
> to be rebalanced, should be sufficient to lock down specific resources.
> 
> actually existence of spec implies that if there is a boot configured
> device with resources described in ACPI table and there isn't _DSM
> method enabling rebalancing for it, then rebalancing is not permitted.
> It should be easy to make an experiment to verify what Windows would do.
> 
> So if this approach would work and we agree on going with it, I could work
> on redoing VMGENv13 series using _DSM as described.
> That would simplify implementing this kind of devices vs bios-linker approach i.e.:
>  - free RAM occupied by linker blob
>  - free IO port
>  - avoid 2 or 3 layers of indirection - which makes understanding of code much easier
>  - avoid runtime AML patching and simplify AML and its composing parts
>  - there won't be need for BIOS to get IO port from fw_cfg and write
>    there GPA as well no need for table lookup.
>  - much easier to write unit tests, i.e. use the same qtest device testing
>    technique without necessity of running actual guest code.
>    i.e. no binary code blobs like we have for running bios-tables test in TCG mode.

No objections on my part. If it works, it works for me!

Laszlo


> 
> 
>> This restricts hotplug functionality significantly.
>>
>>>    
>>>>>>>>
>>>>>>>> See patch at the bottom that might be handy.
>>>>>>>>       
>>>>>>>>> he also innovated a way to use 64-bit address in DSDT/SSDT.rev = 1:
>>>>>>>>> | when writing ASL one shall make sure that only XP supported
>>>>>>>>> | features are in global scope, which is evaluated when tables
>>>>>>>>> | are loaded and features of rev2 and higher are inside methods.
>>>>>>>>> | That way XP doesn't crash as far as it doesn't evaluate unsupported
>>>>>>>>> | opcode and one can guard those opcodes checking _REV object if neccesary.
>>>>>>>>> (https://lists.nongnu.org/archive/html/qemu-devel/2015-11/msg01010.html)      
>>>>>>>>
>>>>>>>> Yes, this technique works.
>>>>>>>>
>>>>>>>> An alternative is to add an XSDT, XP ignores that.
>>>>>>>> XSDT at the moment breaks OVMF (because it loads both
>>>>>>>> the RSDT and the XSDT, which is wrong), but I think
>>>>>>>> Laszlo was working on a fix for that.      
>>>>>>> Using XSDT would increase ACPI tables occupied RAM
>>>>>>> as it would duplicate DSDT + non XP supported AML
>>>>>>> at global namespace.      
>>>>>>
>>>>>> Not at all - I posted patches linking to same
>>>>>> tables from both RSDT and XSDT at some point.
>>>>>> Only the list of pointers would be different.    
>>>>> if you put XP incompatible AML in separate SSDT and link it
>>>>> only from XSDT than that would work but if incompatibility
>>>>> is in DSDT, one would have to provide compat DSDT for RSDT
>>>>> an incompat DSDT for XSDT.    
>>>>
>>>> So don't do this.  
>>> well spec says "An ACPI-compatible OS must use the XSDT if present",
>>> which I read as tables pointed by RSDT MUST be pointed by XSDT
>>> as well and RSDT MUST NOT not be used.
>>>
>>> so if we put incompatible changes in a separate SSDT and put
>>> it only in XSDT that might work. Showstopper here is OVMF which
>>> has issues with it as Laszlo pointed out.  
>>
>> But that's just a bug.
>>
>>> Also since Windows implements only subset of spec XSDT trick
>>> would cover only XP based versions while the rest will see and
>>> use XSDT pointed tables which still could have incompatible
>>> AML with some of the later windows versions.  
>>
>> We'll have to see what these are exactly.
>> If it's methods in SSDT we can check the version supported
>> by the ASPM.
> I see only VAR_PACKAGE as such object so far.
> 
> 64-bit PCI0._CRS probably won't crash 32-bit Vista and later as
> it should be able to parse 64-bit Integers as defined by ACPI 2.0.
> 
>>
>>>   
>>>>   
>>>>> So far policy was don't try to run guest OS on QEMU
>>>>> configuration that isn't supported by it.    
>>>>
>>>> It's better if guests don't see some features but
>>>> don't crash. It's not always possible of course but
>>>> we should try to avoid this.
>>>>   
>>>>> For example we use VAR_PACKAGE when running with more
>>>>> than 255 VCPUs (commit b4f4d5481) which BSODs XP.    
>>>>
>>>> Yes. And it's because we violate the spec, DSDT
>>>> should not have this stuff.
>>>>   
>>>>> So we can continue with that policy with out resorting to
>>>>> using both RSDT and XSDT,
>>>>> It would be even easier as all AML would be dynamically
>>>>> generated and DSDT would only contain AML elements for
>>>>> a concrete QEMU configuration.    
>>>>
>>>> I'd prefer XSDT but I won't nack it if you do it in DSDT.
>>>> I think it's not spec compliant but guests do not
>>>> seem to care.
>>>>   
>>>>>>> So far we've managed keep DSDT compatible with XP while
>>>>>>> introducing features from v2 and higher ACPI revisions as
>>>>>>> AML that is only evaluated on demand.
>>>>>>> We can continue doing so unless we have to unconditionally
>>>>>>> add incompatible AML at global scope.
>>>>>>>       
>>>>>>
>>>>>> Yes.
>>>>>>     
>>>>>>>>       
>>>>>>>>> Michael, Paolo, what do you think about these ideas?
>>>>>>>>>
>>>>>>>>> Thanks!      
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> So using a patch below, we can add Name(PQRS, 0x0) at the top of the
>>>>>>>> SSDT (or bottom, or add a separate SSDT just for that).  It returns the
>>>>>>>> current offset so we can add that to the linker.
>>>>>>>>
>>>>>>>> Won't work if you append the Name to the Aml structure (these can be
>>>>>>>> nested to arbitrary depth using aml_append), so using plain GArray for
>>>>>>>> this API makes sense to me.
>>>>>>>>       
>>>>>>>> --->      
>>>>>>>>
>>>>>>>> acpi: add build_append_named_dword, returning an offset in buffer
>>>>>>>>
>>>>>>>> This is a very limited form of support for runtime patching -
>>>>>>>> similar in functionality to what we can do with ACPI_EXTRACT
>>>>>>>> macros in python, but implemented in C.
>>>>>>>>
>>>>>>>> This is to allow ACPI code direct access to data tables -
>>>>>>>> which is exactly what DataTableRegion is there for, except
>>>>>>>> no known windows release so far implements DataTableRegion.      
>>>>>>> unsupported means Windows will BSOD, so it's practically
>>>>>>> unusable unless MS will patch currently existing Windows
>>>>>>> versions.      
>>>>>>
>>>>>> Yes. That's why my patch allows patching SSDT without using
>>>>>> DataTableRegion.
>>>>>>     
>>>>>>> Another thing about DataTableRegion is that ACPI tables are
>>>>>>> supposed to have static content which matches checksum in
>>>>>>> table the header while you are trying to use it for dynamic
>>>>>>> data. It would be cleaner/more compatible to teach
>>>>>>> bios-linker-loader to just allocate memory and patch AML
>>>>>>> with the allocated address.      
>>>>>>
>>>>>> Yes - if address is static, you need to put it outside
>>>>>> the table. Can come right before or right after this.
>>>>>>     
>>>>>>> Also if OperationRegion() is used, then one has to patch
>>>>>>> DefOpRegion directly as RegionOffset must be Integer,
>>>>>>> using variable names is not permitted there.      
>>>>>>
>>>>>> I am not sure the comment was understood correctly.
>>>>>> The comment says really "we can't use DataTableRegion
>>>>>> so here is an alternative".    
>>>>> so how are you going to access data at which patched
>>>>> NameString point to?
>>>>> for that you'd need a normal patched OperationRegion
>>>>> as well since DataTableRegion isn't usable.    
>>>>
>>>> For VMGENID you would patch the method that
>>>> returns the address - you do not need an op region
>>>> as you never access it.
>>>>
>>>> I don't know about NVDIMM. Maybe OperationRegion can
>>>> use the patched NameString? Will need some thought.
>>>>   
>>>>>>     
>>>>>>>       
>>>>>>>>
>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
>>>>>>>> index 1b632dc..f8998ea 100644
>>>>>>>> --- a/include/hw/acpi/aml-build.h
>>>>>>>> +++ b/include/hw/acpi/aml-build.h
>>>>>>>> @@ -286,4 +286,7 @@ void acpi_build_tables_cleanup(AcpiBuildTables *tables, bool mfre);
>>>>>>>>  void
>>>>>>>>  build_rsdt(GArray *table_data, GArray *linker, GArray *table_offsets);
>>>>>>>>  
>>>>>>>> +int
>>>>>>>> +build_append_named_dword(GArray *array, const char *name_format, ...);
>>>>>>>> +
>>>>>>>>  #endif
>>>>>>>> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
>>>>>>>> index 0d4b324..7f9fa65 100644
>>>>>>>> --- a/hw/acpi/aml-build.c
>>>>>>>> +++ b/hw/acpi/aml-build.c
>>>>>>>> @@ -262,6 +262,32 @@ static void build_append_int(GArray *table, uint64_t value)
>>>>>>>>      }
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +/* Build NAME(XXXX, 0x0) where 0x0 is encoded as a qword,
>>>>>>>> + * and return the offset to 0x0 for runtime patching.
>>>>>>>> + *
>>>>>>>> + * Warning: runtime patching is best avoided. Only use this as
>>>>>>>> + * a replacement for DataTableRegion (for guests that don't
>>>>>>>> + * support it).
>>>>>>>> + */
>>>>>>>> +int
>>>>>>>> +build_append_named_qword(GArray *array, const char *name_format, ...)
>>>>>>>> +{
>>>>>>>> +    int offset;
>>>>>>>> +    va_list ap;
>>>>>>>> +
>>>>>>>> +    va_start(ap, name_format);
>>>>>>>> +    build_append_namestringv(array, name_format, ap);
>>>>>>>> +    va_end(ap);
>>>>>>>> +
>>>>>>>> +    build_append_byte(array, 0x0E); /* QWordPrefix */
>>>>>>>> +
>>>>>>>> +    offset = array->len;
>>>>>>>> +    build_append_int_noprefix(array, 0x0, 8);
>>>>>>>> +    assert(array->len == offset + 8);
>>>>>>>> +
>>>>>>>> +    return offset;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static GPtrArray *alloc_list;
>>>>>>>>  
>>>>>>>>  static Aml *aml_alloc(void)
>>>>>>>>
>>>>>>>>       
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html    
>>>>   
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux