On Tue, Jan 29, 2019 at 06:48:33PM +0000, James Morse wrote: > If firmware has never generated CPER records, so it has never written to void > *error_status_address, yes. I guess this is the bit of information I was missing. > There seem to be two ways of doing this. This zero check implies an example > system could be: > | g->error_status_address == 0xf00d > | *(u64 *)0xf00d == 0 > Firmware populates CPER records, then updates 0xf00d. > (0xf00d would have been pre-mapped by apei_map_generic_address() in ghes_new()) > Reads of 0xf00d before CPER records are generated get 0. Ok, this sounds like the polled case. FW better have a record ready before raising the NMI. > Once an error occurs, this system now looks like this: > | g->error_status_address == 0xf00d > | *(u64 *)0xf00d == 0xbeef > | *(u64 *)0xbeef == 0 > > For new errors, firmware populates CPER records, then updates 0xf00d. > Alternatively firmware could re-use the memory at 0xbeef, generating the CPER > records backwards, so that once 0xbeef is updated, the rest of the record is > visible. (firmware knows not to race with another CPU right?) Thanks for the comic relief. :-P > Firmware could equally point 0xf00d at 0xbeef at startup, so it has one fewer > values to write when an error occurs. I have an arm64 system with a HEST that > does this. (I'm pretty sure its ACPI support is a copy-and-paste from x86, it > even describes NOTIFY_NMI, who knows what that means on arm!) Oh the fun. > When linux processes an error, ghes_clear_estatus() NULLs the > estatus->block_status, (which in this example is at 0xbeef). This is the > documented sequence for GHESv2. > Elsewhere the spec talks of checking the block status which is part of the > records, (not the error_status_address, which is the pointer to the records). > > Linux can't NULL 0xf00d, because it doesn't know if firmware will write it again > next time it updates the records. > I can't find where in the spec it says the error status address is written to. > Linux works with both 'at boot' and 'on each error'. > If it were know to have a static value, ghes_copy_tofrom_phys() would not have > been necessary, but its been there since d334a49113a4. > > In the worst case, if there is a value at the error_status_address, we have to > map/unmap it every time we poll in case firmware wrote new records at that same > location. > > I don't think we can change Linux's behaviour here, without interpreting zero as > CPER records or missing new errors. Nah, I was simply trying to figure out why we do that buf_paddr check. Thanks for the extensive clarification. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply.