Re: [PATCH v14 2/9] ACPI: Add APEI GHES table generation and CPER record support

Peter Maydell <peter.maydell@xxxxxxxxxx> · Tue, 9 Jan 2018 16:51:31 +0000

On 3 January 2018 at 02:21, gengdongjiu <gengdongjiu@xxxxxxxxxx> wrote:
> On 2017/12/28 22:18, Igor Mammedov wrote:
>> On Thu, 28 Dec 2017 13:54:11 +0800
>> Dongjiu Geng <gengdongjiu@xxxxxxxxxx> wrote:
>>> In order to simulation, we hard code the error
>>> type to Multi-bit ECC.
>> Not sure what this is about, care to elaborate?
>
> please see Memory Error Record in [1], in which the "Memory Error Type" field is used to describe the
> error type, such as  Multi-bit ECC or Parity Error etc. Because KVM or host does not pass the memory
> error type to Qemu, so Qemu does not know what is the error type for the memory section. Hence we let QEMU simulate
> the error type to Multi-bit ECC.
>
> [1]:
> UEFI Spec 2.6 Errata A:
>
> "N.2.5 Memory Error Section"
> -----------------+---------------+--------------+-------------------------------------------+
>         Mnemonic |   Byte Offset |  Byte Length |        Description                        |
> -----------------+---------------+--------------+-------------------------------------------+
>         ........ |  ............ |  .........   |        ...........                        |
> -----------------+---------------+--------------+-------------------------------------------+
> Memory Error Type|     72        |       1      |Identifies the type of error that occurred:|
>                  |               |              | 0 – Unknown                              |
>                  |               |              | 1 – No error                             |
>                  |               |              | 2 – Single-bit ECC                       |
>                  |               |              | 3 – Multi-bit ECC                        |
>                  |               |              | 4 – Single-symbol ChipKill ECC           |
>                  |               |              | 5 – Multi-symbol ChipKill ECC            |
>                  |               |              | 6 – Master abort                          |
>                  |               |              | 7 – Target abort                          |
>                  |               |              | 8 – Parity Error                          |
>                  |               |              | 9 – Watchdog timeout                      |
>                  |               |              | 10 – Invalid address                      |
>                  |               |              | 11 – Mirror Broken                        |
>                  |               |              | 12 – Memory Sparing                       |
>                  |               |              | 13 - Scrub corrected error                |
>                  |               |              | 14 - Scrub uncorrected error              |
>                  |               |              | 15 - Physical Memory Map-out event        |
>                  |               |              | All other values reserved.                |
> -----------------+---------------+--------------+-------------------------------------------+
>         ........ |  ............ |  .........   |        ...........                        |
> -----------------+---------------+--------------+-------------------------------------------+

There's a value specified for "we don't know what the error type is",
which is "0 - Unknown". I think we should use that rather than claiming
that we have a particular type of error when we don't actually know that.

I agree with James that we don't want to report a particular type of
error to the guest anyway -- the VM is a virtual environment, and
the exact reason why the hardware/firmware/host kernel have decided
that a bit of RAM isn't usable any more doesn't matter to the guest.
We just want to report "this RAM has gone away, sorry" to it.

thanks
-- PMM