Jonathan Cameron wrote: > On Mon, 8 Jan 2024 18:59:16 -0800 > Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > Ira Weiny wrote: > > > Dan Williams wrote: > > > > Smita Koralahalli wrote: > > > > > On 1/8/2024 8:58 AM, Jonathan Cameron wrote: > > > > > > On Wed, 20 Dec 2023 16:17:27 -0800 > > > > > > Ira Weiny <ira.weiny@xxxxxxxxx> wrote: > > > > > > > > > > > >> Series status/background > > > > > >> ======================== > > > > > >> > > > > > >> Smita has been a great help with this series. Thank you again! > > > > > >> > > > > > >> Smita's testing found that the GHES code ended up printing the events > > > > > >> twice. This version avoids the duplicate print by calling the callback > > > > > >> from the GHES code instead of the EFI code as suggested by Dan. > > > > > > > > > > > > I'm not sure this is working as intended. > > > > > > > > > > > > There is nothing gating the call in ghes_proc() of ghes_print_estatus() > > > > > > and now the EFI code handling that pretty printed things is missing we get > > > > > > the horrible kernel logging for an unknown block instead. > > > > > > > > > > > > So I think we need some minimal code in cper.c to match the guids then not > > > > > > log them (on basis we are arguing there is no need for new cper records). > > > > > > Otherwise we are in for some messy kernel logs > > > > > > > > > > > > Something like: > > > > > > > > > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > > > > > {1}[Hardware Error]: event severity: recoverable > > > > > > {1}[Hardware Error]: Error 0, type: recoverable > > > > > > {1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6 > > > > > > {1}[Hardware Error]: section length: 0x90 > > > > > > {1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................ > > > > > > {1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ > > > > > > {1}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ > > > > > > cxl_general_media: memdev=mem1 host=0000:10:00.0 serial=4 log=Informational : time=0 uuid=fbcd0a77-c260-417f-85a9-088b1621eba6 len=0 flags='' handle=0 related_handle=0 maint_op_class=0 : dpa=0 dpa_flags='' descriptor='' type='ECC Error' transaction_type='Unknown' channel=0 rank=0 device=0 comp_id=00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 validity_flags='' > > > > > > > > > > > > (I'm filling the record with 0s currently) > > > > > > > > > > Yeah, when I tested this, I thought its okay for the hexdump to be there > > > > > in dmesg from EFI as the handling is done in trace events from GHES. > > > > > > > > > > If, we need to handle from EFI, then it would be a good reason to move > > > > > the GUIDs out from GHES and place it in a common location for EFI/cper > > > > > to share similar to protocol errors. > > > > > > > > Ah, yes, my expectation was more aligned with Jonathan's observation to > > > > do the processing in GHES code *and* skip the processing in the CPER > > > > code, something like: > > > > > > > > > > Agreed this was intended I did not realize the above. > > > > > > > > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > > > > index 35c37f667781..0a4eed470750 100644 > > > > --- a/drivers/firmware/efi/cper.c > > > > +++ b/drivers/firmware/efi/cper.c > > > > @@ -24,6 +24,7 @@ > > > > #include <linux/bcd.h> > > > > #include <acpi/ghes.h> > > > > #include <ras/ras_event.h> > > > > +#include <linux/cxl-event.h> > > > > #include "cper_cxl.h" > > > > > > > > /* > > > > @@ -607,6 +608,15 @@ cper_estatus_print_section(const char *pfx, struct acpi_hest_generic_data *gdata > > > > cper_print_prot_err(newpfx, prot_err); > > > > else > > > > goto err_section_too_small; > > > > + } else if (guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID)) { > > > > + printk("%ssection_type: CXL General Media Error\n", newpfx); > > > > > > Do we want the printk's here? I did not realize that a generic event > > > would be printed. So intention was nothing would be done on this path. > > > > I think we do otherwise the kernel will say > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > {1}[Hardware Error]: event severity: recoverable > > {1}[Hardware Error]: Error 0, type: recoverable > > ... > > > > ...leaving the user hanging vs: > > > > {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > {1}[Hardware Error]: event severity: recoverable > > {1}[Hardware Error]: Error 0, type: recoverable > > {1}[Hardware Error]: section type: General Media Error > > > > ...as an indicator to go follow up with rasdaemon or whatever else is > > doing the detailed monitoring of CXL events. > > Agreed. Maybe push it out to a static const table though. > As the argument was that we shouldn't be spitting out big logs in this > modern world, let's make it easy for people to add more entries. > > struct skip_me { > guid_t guid; > const char *name; > }; > static const struct skip_me skip_me = { > { &CPER_SEC_CXL_GEN_MEDIA, "CXL General Media Error" }, > etc. > }; > > for (i = 0; i < ARRAY_SIZE(skip_me); i++) { > if (guid_equal(sec_type, skip_me[i].guid)) { > printk("%asection_type: %s\n", newpfx, skip_me[i].name); > break; > } > > or something like that in the final else. I like it. Any concerns with that being an -rc fixup, and move ahead with the base enabling for v6.8? I don't see that follow-on as a reason to push the whole thing to v6.9.