Hi James, >-----Original Message----- >From: linux-pci-owner@xxxxxxxxxxxxxxx [mailto:linux-pci- >owner@xxxxxxxxxxxxxxx] On Behalf Of James Morse >Sent: 08 April 2020 11:03 >To: Borislav Petkov <bp@xxxxxxxxx>; Shiju Jose <shiju.jose@xxxxxxxxxx> >Cc: linux-acpi@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; linux- >kernel@xxxxxxxxxxxxxxx; rjw@xxxxxxxxxxxxx; helgaas@xxxxxxxxxx; >lenb@xxxxxxxxxx; tony.luck@xxxxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; >zhangliguang@xxxxxxxxxxxxxxxxx; tglx@xxxxxxxxxxxxx; Linuxarm ><linuxarm@xxxxxxxxxx>; Jonathan Cameron ><jonathan.cameron@xxxxxxxxxx>; tanxiaofei <tanxiaofei@xxxxxxxxxx>; >yangyicong <yangyicong@xxxxxxxxxx> >Subject: Re: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor >specific HW errors > >Hi Boris, Shiju, > >Sorry for not spotting this reply earlier: Its in-reply to v1, so gets buried. I will resend the v7 patch solving this issue. I guess the remaining questions here are for Boris. May be can we discuss your comments with V7 patch, which I will send? > >On 27/03/2020 18:22, Borislav Petkov wrote: >> On Wed, Mar 25, 2020 at 04:42:22PM +0000, Shiju Jose wrote: >>> Presently APEI does not support reporting the vendor specific HW >>> errors, received in the vendor defined table entries, to the vendor >>> drivers for any recovery. >>> >>> This patch adds the support to register and unregister the >> >> Avoid having "This patch" or "This commit" in the commit message. It >> is tautologically useless. >> >> Also, do >> >> $ git grep 'This patch' Documentation/process >> >> for more details. >> >>> error handling function for the vendor specific HW errors and notify >>> the registered kernel driver. > >>> @@ -526,10 +552,17 @@ static void ghes_do_proc(struct ghes *ghes, >>> log_arm_hw_error(err); >>> } else { >>> void *err = acpi_hest_get_payload(gdata); >>> + u8 error_handled = false; >>> + int ret; >>> + >>> + ret = >atomic_notifier_call_chain(&ghes_event_notify_list, 0, >>> +gdata); >> >> Well, this is a notifier with standard name for a non-standard event. >> Not optimal. >> >> Why does only this event need a notifier? Because your driver is >> interested in only those events? > >Its the 'else' catch-all for stuff drivers/acpi/apei doesn't know to handle. > >In this case its because its a vendor specific GUID that only the vendor driver >knows how to parse. > > >>> + if (ret & NOTIFY_OK) >>> + error_handled = true; >>> >>> log_non_standard_event(sec_type, fru_id, fru_text, >>> sec_sev, err, >>> - gdata->error_data_length); >>> + gdata->error_data_length, >>> + error_handled); >> >> What's that error_handled thing for? That's just silly. >> >> Your notifier returns NOTIFY_STOP when it has queued the error. If you >> don't want to log it, just test == NOTIFY_STOP and do not log it then. > >My thinking for this being needed was so user-space consumers of those >tracepoints keep working. Otherwise you upgrade, get this feature, and your >user-space counters stop working. > >You'd need to know this error source was now managed by an in-kernel >driver, which may report the errors somewhere else... > > >> Then your notifier callback is queuing the error into a kfifo for >> whatever reason and then scheduling a workqueue to handle it in user >> context... >> >> So I'm thinking that it would be better if you: >> >> * make that kfifo generic and part of ghes.c and queue all types of >> error records into it in ghes_do_proc() - not just the non-standard >> ones. > >Move the drop to process context into ghes.c? This should result in less code. > >I asked for this hooking to only be for the 'catch all' don't-know case so that >we don't get drivers trying to hook and handle memory errors. (if we ever >wanted that, it should be from part of memory_failure() so it catches all the >ways of reporting memory-failure) 32bit arm has prior in this area. > > >> * then, when you're done queuing, you kick a workqueue. >> >> * that workqueue runs a normal, blocking notifier to which drivers >> register. >> >> Your driver can register to that notifier too and do the normal >> handling then and not have this ad-hoc, semi-generic, semi-vendor-specific >thing. > >As long as we don't walk a list of things that might handle a memory-error, >and have some random driver try and NOTIFY_STOP it.... > >aer_recover_queue() would be replaced by this. memory_failure_queue() has >one additional caller in drivers/ras/cec.c. > > >Thanks, > >James Thanks, Shiju