Re: [PATCH V8 06/10] acpi: apei: panic OS with fatal error status block

"Baicar, Tyler" <tbaicar@xxxxxxxxxxxxxx> · Mon, 13 Feb 2017 15:45:44 -0700

Hello James,


On 2/9/2017 3:48 AM, James Morse wrote:
Hi Jonathan, Tyler,

On 01/02/17 17:16, Tyler Baicar wrote:
From: "Jonathan (Zhixiong) Zhang" <zjzhang@xxxxxxxxxxxxxx>

Even if an error status block's severity is fatal, the kernel does not
honor the severity level and panic.

With the firmware first model, the platform could inform the OS about a
fatal hardware error through the non-NMI GHES notification type. The OS
should panic when a hardware error record is received with this
severity.

Call panic() after CPER data in error status block is printed if
severity is fatal, before each error section is handled.

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 8756172..86c1f15 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -687,6 +689,13 @@ static int ghes_ack_error(struct acpi_hest_generic_v2 *generic_v2)
  	return rc;
  }
  
+static void __ghes_call_panic(void)
+{
+	if (panic_timeout == 0)
+		panic_timeout = ghes_panic_timeout;
+	panic("Fatal hardware error!");
+}
+
__ghes_panic() also has:
	__ghes_print_estatus(KERN_EMERG, ghes->generic, ghes->estatus);
Which prints this estatus regardless of rate limiting and cache-ing.

[ ... ]

@@ -698,6 +707,10 @@ static int ghes_proc(struct ghes *ghes)
  		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
ghes_print_estatus() uses some custom rate limiting '2 messages every 5
seconds', GHES_SEV_PANIC shares the same limit as GHES_SEV_RECOVERABLE.

I think its possible to get 2 recoverable messages, then a panic in a 5 second
window. The rate limit will kick in to stop the panic estatus block being
printed, but we still go on to call panic() without the real reason being printed...

(the caching thing only seems to consider identical messages, given we would
never see two panic messages, I don't think that will cause any problems.)

  			ghes_estatus_cache_add(ghes->generic, ghes->estatus);
  	}
+	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+		__ghes_call_panic();
+	}
+
I think this ghes_severity() then panic() should go above the:
	if (!ghes_estatus_cached(ghes->estatus)) {
and we should call __ghes_print_estatus() here too, to make sure the message
definitely got out!
Okay, that makes sense. If we move this up, is there a problem with 
calling __ghes_panic() instead of making the __ghes_print_estatus() and 
__ghes_call_panic() calls here? It looks like that will just add a call 
to oops_begin() and ghes_print_queued_estatus() as well, but this is 
what ghes_notify_nmi() does if the severity is panic.

Thanks,
Tyler
With that,
Reviewed-by: James Morse <james.morse@xxxxxxx>


Thanks,

James

--
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html