Hi Mark, On 18/06/18 23:18, Mark Salter wrote: > On Mon, 2018-06-18 at 11:04 -0700, Geoff Levand wrote: >> Thanks for all the comments, but my lack of access to an m400 platform, and >> my lack of knowledge about the m400 limits what I can comment on and what I >> can do. > > I can take another look at this on an m400 here. Thanks! > I don't believe it is a > memory access to physical space with nothing attached to it. That is what the CPER records are describing though. > I seem to recall > an errata with xgene-1 where such accesses cause the cpu to halt. But I could > be misremembering that. I have no trouble believing the firmware ras code was > untested. It is probably some boilerplate code built in before ras was supported > in kernel. It would be interesting to know which GHES this error is being found in, and whether the Error Status Block points anywhere (or at an empty block) when Linux is started from UEFI. If there is something in the Error Status Block out of UEFI, then this must be something triggered by UEFI, or a bug that can be fixed by UEFI clearing out the CPER records. https://bugzilla.redhat.com/show_bug.cgi?id=1285107 suggests redhat can rebuild the UEFI firmware for this box. If there is nothing in the Error Status Block when Linux is started, surely Linux is doing something to cause this to happen. I'd like to find out what, as its probably a software bug. (The case where disabling HEST would be the right thing to do is if there is a bogus GHES->GAS entry in GHES.0, the access to which causes GHES.1 to be populated with 'Access to an address not mapped to any component', which we find next. If this is the case it would be better to check GHES entries against the UEFI memory map to check this is memory, and it was reserved.) > But the problem occurs early enough in boot where there can't be > that many things that would cause a problem on m400 and not mustang so I'll > look again. Playing spot the difference in the dmesg, I'd check for smoke coming out of: | acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 | xgene-gpio APMC0D14:00: X-Gene GPIO driver registered. | pcie_pme: probe of 0000:00:00.0:pcie001 failed with error -22 If the firmware description of the GIC is wrong in someway, disabling KVM may be worth testing too. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html