Hi Laszlo, thanks. On 2017/4/7 2:55, Laszlo Ersek wrote: > On 04/06/17 14:35, gengdongjiu wrote: >> Dear, Laszlo >> Thanks for your detailed explanation. >> >> On 2017/3/29 19:58, Laszlo Ersek wrote: >>> (This ought to be one of the longest address lists I've ever seen :) >>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good >>> measure, I'm adding MST and Igor.) >>> >>> On 03/29/17 12:36, Achin Gupta wrote: >>>> Hi gengdongjiu, >>>> >>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote: >>>>> >>>>> Hi Laszlo/Biesheuvel/Qemu developer, >>>>> >>>>> Now I encounter a issue and want to consult with you in ARM64 platform, as described below: >>>>> >>>>> when guest OS happen synchronous or asynchronous abort, kvm needs >>>>> to send the error address to Qemu or UEFI through sigbus to >>>>> dynamically generate APEI table. from my investigation, there are >>>>> two ways: >>>>> >>>>> (1) Qemu get the error address, and generate the APEI table, then >>>>> notify UEFI to know this generation, then inject abort error to >>>>> guest OS, guest OS read the APEI table. >>>>> (2) Qemu get the error address, and let UEFI to generate the APEI >>>>> table, then inject abort error to guest OS, guest OS read the APEI >>>>> table. >>>> >>>> Just being pedantic! I don't think we are talking about creating the APEI table >>>> dynamically here. The issue is: Once KVM has received an error that is destined >>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error >>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated >>>> corresponding to the error source (GHES corresponding to memory subsystem, >>>> processor etc) to allow the guest OS to do anything meaningful with the >>>> error. So who should create the CPER is the question. >>>> >>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives >>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is >>>> responsible for creating the CPER. ARM is experimenting with using a Standalone >>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid >>>> adding the same code in ARM TF in EL3 (better for security). The error will then >>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted >>>> Firmware. >>>> >>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1 >>>> interface (as discussed with Christoffer below). So it should generate the CPER >>>> before injecting the error. >>>> >>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you >>>> mean guest UEFI). At this time, the guest OS already knows where to pick up the >>>> CPER from through the HEST. Qemu has to create the CPER and populate its address >>>> at the address exported in the HEST. Guest UEFI should not be involved in this >>>> flow. Its job was to create the HEST at boot and that has been done by this >>>> stage. >>>> >>>> Qemu folk will be able to add but it looks like support for CPER generation will >>>> need to be added to Qemu. We need to resolve this. >>>> >>>> Do shout if I am missing anything above. >>> >>> After reading this email, the use case looks *very* similar to what >>> we've just done with VMGENID for QEMU 2.9. >>> >>> We have a facility between QEMU and the guest firmware, called "ACPI >>> linker/loader", with which QEMU instructs the firmware to >>> >>> - allocate and download blobs into guest RAM (AcpiNVS type memory) -- >>> ALLOCATE command, >>> >>> - relocate pointers in those blobs, to fields in other (or the same) >>> blobs -- ADD_POINTER command, >>> >>> - set ACPI table checksums -- ADD_CHECKSUM command, >>> >>> - and send GPAs of fields within such blobs back to QEMU -- >>> WRITE_POINTER command. >>> >>> This is how I imagine we can map the facility to the current use case >>> (note that this is the first time I read about HEST / GHES / CPER): >>> >>> etc/acpi/tables etc/hardware_errors >>> ================ ========================================== >>> +-----------+ >>> +--------------+ | address | +-> +--------------+ >>> | HEST + | registers | | | Error Status | >>> + +------------+ | +---------+ | | Data Block 1 | >>> | | GHES | --> | | address | --------+ | +------------+ >>> | | GHES | --> | | address | ------+ | | CPER | >>> | | GHES | --> | | address | ----+ | | | CPER | >>> | | GHES | --> | | address | -+ | | | | CPER | >>> +-+------------+ +-+---------+ | | | +-+------------+ >>> | | | >>> | | +---> +--------------+ >>> | | | Error Status | >>> | | | Data Block 2 | >>> | | | +------------+ >>> | | | | CPER | >>> | | | | CPER | >>> | | +-+------------+ >>> | | >>> | +-----> +--------------+ >>> | | Error Status | >>> | | Data Block 3 | >>> | | +------------+ >>> | | | CPER | >>> | +-+------------+ >>> | >>> +--------> +--------------+ >>> | Error Status | >>> | Data Block 4 | >>> | +------------+ >>> | | CPER | >>> | | CPER | >>> | | CPER | >>> +-+------------+ >>> >>> (1) QEMU generates the HEST ACPI table. This table goes in the current >>> "etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N >>> GHES objects in the HEST. >>> >>> (2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU >>> also populates this blob. >>> >>> (2a) Given N error sources, the (unnamed) table of address registers >>> will contain N address registers. >>> >>> (2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will >>> also contain N Error Status Data Blocks. >>> >>> I don't know about the sizing (number of CPERs) each Error Status Data >>> Block has to contain, but I understand it is all pre-allocated as far as >>> the OS is concerned, which matches our capabilities well. >> here I have a question. as you comment: " 'etc/hardwre_errors' fw_cfg blob will also contain N Error Status Data Blocks", >> Because the CPER numbers is not fixed, how to assign each "Error Status Data Block" size using one "etc/hardwre_errors" fw_cfg blob. >> when use one etc/hardwre_errors, will the N Error Status Data Block use one continuous buffer? as shown below. if so, maybe it not convenient for each data block size extension. >> I see the bios_linker_loader_alloc will allocate one continuous buffer for a blob(such as VMGENID_GUID_FW_CFG_FILE) >> >> /* Allocate guest memory for the Data fw_cfg blob */ >> bios_linker_loader_alloc(linker, VMGENID_GUID_FW_CFG_FILE, guid, 4096, >> false /* page boundary, high memory */); >> >> >> >> -> +--------------+ >> | HEST + | registers | | Error Status | >> + +------------+ | +---------+ | Data Block | >> | | GHES | --> | | address | --------+-->| +------------+ >> | | GHES | --> | | address | ------+ | | CPER | >> | | GHES | --> | | address | ----+ | | | CPER | >> | | GHES | --> | | address | -+ | | | | CPER | >> +-+------------+ +-+---------+ | | +---> +--------------+ >> | | | | CPER | >> | | | | CPER | >> | +-----> +--------------+ >> | | | CPER | >> +--------> +--------------+ >> | | CPER | >> | | CPER | >> | | CPER | >> +-+------------+ >> >> >> >> so how about we use separate etc/hardwre_errorsN for each Error Status status Block? then >> >> etc/hardwre_errors0 >> etc/hardwre_errors1 >> ................... >> etc/hardwre_errors10 >> (the max N is 10) >> >> >> the N can be one of below values, according to ACPI spec "Table 18-345 Hardware Error Notification Structure" >> 0 – Polled >> 1 – External Interrupt >> 2 – Local Interrupt >> 3 – SCI >> 4 – NMI >> 5 - CMCI >> 6 - MCE >> 7 - GPIO-Signal >> 8 - ARMv8 SEA >> 9 - ARMv8 SEI >> 10 - External Interrupt - GSIV > > I'm unsure if, by "not fixed", you are saying > > the number of CPER entries that fits in Error Status Data Block N is > not *uniform* across 0 <= N <= 10 [1] > > or > > the number of CPER entries that fits in Error Status Data Block N is > not *known* in advance, for all of 0 <= N <= 10 [2] > > Which one is your point? > > If [1], that's no problem; you can simply sum the individual error > status data block sizes in advance, and allocate "etc/hardware_errors" > accordingly, using the total size. > > (Allocating one shared fw_cfg blob for all status data blocks is more > memory efficient, as each ALLOCATE command will allocate whole pages > (rounded up from the actual blob size).) > > If your point is [2], then splitting the error status data blocks to > separate fw_cfg blobs makes no difference: regardless of whether we try > to place all the error status data blocks in a single fw_cfg blob, or in > separate fw_cfg blobs, the individual data block cannot be resized at OS > runtime, so there's no way to make it work. > My Point is [2]. The HEST(Hardware Error Source Table) table format is here: https://wiki.linaro.org/LEG/Engineering/Kernel/RAS/APEITables#Hardware_Error_Source_Table_.28HEST.29 Now I understand your thought. > Thanks, > Laszlo > >> >> >> >> >>> >>> (3) QEMU generates the ACPI linker/loader script for the firmware, as >>> always. >>> >>> (3a) The HEST table is part of "etc/acpi/tables", which the firmware >>> already allocates memory for, and downloads (because QEMU already >>> generates an ALLOCATE linker/loader command for it already). >>> >>> (3b) QEMU will have to create another ALLOCATE command for the >>> "etc/hardware_errors" blob. The firmware allocates memory for this blob, >>> and downloads it. >>> >>> (4) QEMU generates, in the ACPI linker/loader script for the firwmare, N >>> ADD_POINTER commands, which point the GHES."Error Status >>> Address" fields in the HEST table, to the corresponding address >>> registers in the downloaded "etc/hardware_errors" blob. >>> >>> (5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the >>> HEST table is correctly checksummed after executing the N ADD_POINTER >>> commands from (4). >>> >>> (6) QEMU generates N ADD_POINTER commands for the firmware, pointing the >>> address registers (located in guest memory, in the downloaded >>> "etc/hardware_errors" blob) to the respective Error Status Data Blocks. >>> >>> (7) (This is the trick.) For this step, we need a third, write-only >>> fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the >>> firmware can send back the guest-side allocation addresses to QEMU. >>> >>> Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries. >>> QEMU generates N WRITE_POINTER commands for the firmware. >>> >>> For error source K (0 <= K < N), QEMU instructs the firmware to >>> calculate the guest address of Error Status Data Block K, from the >>> QEMU-dictated offset within "etc/hardware_errors", and from the >>> guest-determined allocation base address for "etc/hardware_errors". The >>> firmware then writes the calculated address back to fw_cfg file >>> "etc/hardware_errors_addr", at offset K*8, according to the >>> WRITE_POINTER command. >>> >>> This way QEMU will know the GPA of each Error Status Data Block. >>> >>> (In fact this can be simplified to a single WRITE_POINTER command: the >>> address of the "address register table" can be sent back to QEMU as >>> well, which already contains all Error Status Data Block addresses.) >>> >>> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come >>> through a signalfd -- QEMU can format the CPER right into guest memory, >>> and then inject whatever interrupt (or assert whatever GPIO line) is >>> necessary for notifying the guest. >>> >>> (9) This notification (in virtual hardware) can either be handled by the >>> guest kernel stand-alone, or else the guest kernel can invoke an ACPI >>> event handler method with it (which would be in the DSDT or one of the >>> SSDTs, also generated by QEMU). The ACPI event handler method could >>> invoke the specific guest kernel driver for errror handling via a >>> Notify() operation. >>> >>> I'm attracted to the above design because: >>> - it would leave the firmware alone after OS boot, and >>> - it would leave the firmware blissfully ignorant about HEST, GHES, >>> CPER, and the like. (That's why QEMU's ACPI linker/loader was invented >>> in the first place.) >>> >>> Thanks >>> Laszlo >>> >>>>> Do you think which modules generates the APEI table is better? UEFI or Qemu? >>>>> >>>>> >>>>> >>>>> >>>>> On 2017/3/28 21:40, James Morse wrote: >>>>>> Hi gengdongjiu, >>>>>> >>>>>> On 28/03/17 13:16, gengdongjiu wrote: >>>>>>> On 2017/3/28 19:54, Achin Gupta wrote: >>>>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote: >>>>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote: >>>>>>>>>> On the host, part of UEFI is involved to generate the CPER records. >>>>>>>>>> In a guest?, I don't know. >>>>>>>>>> Qemu could generate the records, or drive some other component to do it. >>>>>>>>> >>>>>>>>> I think I am beginning to understand this a bit. Since the guet UEFI >>>>>>>>> instance is specifically built for the machine it runs on, QEMU's virt >>>>>>>>> machine in this case, they could simply agree (by some contract) to >>>>>>>>> place the records at some specific location in memory, and if the guest >>>>>>>>> kernel asks its guest UEFI for that location, things should just work by >>>>>>>>> having logic in QEMU to process error reports and populate guest memory. >>>>>>>>> >>>>>>>>> Is this how others see the world too? >>>>>>>> >>>>>>>> I think so! >>>>>>>> >>>>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in >>>>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a >>>>>>>> HEST for the guest Kernel? >>>>>>>> >>>>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as >>>>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between >>>>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU >>>>>>>> where the memory is. >>>>>>> >>>>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu >>>>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu >>>>>> logical. >>>>>>> let Qemu generate CPER record may be clear. >>>>>> >>>>>> At boot UEFI in the guest will need to make sure the areas of memory that may be >>>>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are >>>>>> needs deciding, (but probably not here)... >>>>>> >>>>>> At runtime, when an error has occurred, I agree it would be simpler (fewer >>>>>> components involved) if Qemu generates the CPER records. But if UEFI made the >>>>>> memory choice above they need to interact and it gets complicated again. The >>>>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain >>>>>> code to generate/parse them. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> James >>>>>> >>>>>> >>>>>> . >>>>>> >>>>> >>> >>> >>> . >>> >> > > > . >