Re: [PATCH v3 1/2] cper, apei, mce: Pass x86 CPER through the MCA handling chain

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Sep 03, 2020 at 06:45:30PM -0500, Smita Koralahalli wrote:
> Linux Kernel uses ACPI Boot Error Record Table (BERT) to report fatal
> errors that occurred in a previous boot. The MCA errors in the BERT are
> reported using the x86 Processor Error Common Platform Error Record (CPER)
> format. Currently, the record prints out the raw MSR values and AMD relies
> on the raw record to provide MCA information.
> 
> Extract the raw MSR values of MCA registers from the BERT and feed it into
> the standard mce_log() function through the existing x86/MCA RAS
> infrastructure. This will result in better decoding from the EDAC MCE
> decoder or the default notifier.
> 
> The implementation is SMCA specific as the raw MCA register values are
> given in the register offset order of the MCAX address space.
> 
> Reported-by: kernel test robot <lkp@xxxxxxxxx>

What's that Reported-by for?

Pls put in [] brackets over it what the 0day robot has reported.

> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx>
> ---
> Link:
> https://lkml.kernel.org/r/20200828203332.11129-2-Smita.KoralahalliChannabasappa@xxxxxxx
> 
> v3:
> 	Moved arch specific declarations from generic header file to arch
> 	specific header file.
> 	Cleaned additional declarations which are unnecessary.
> 	Included the check for context type.
> 	Added a check to verify for the first MSR address in the register
> 	layout.
> v2:
> 	Fixed build error reported by kernel test robot.
> 	Passed struct variable as function argument instead of entire struct
> ---
>  arch/x86/include/asm/acpi.h     | 11 +++++++++
>  arch/x86/include/asm/mce.h      |  3 +++
>  arch/x86/kernel/acpi/apei.c     |  9 +++++++
>  arch/x86/kernel/cpu/mce/apei.c  | 42 +++++++++++++++++++++++++++++++++
>  drivers/firmware/efi/cper-x86.c | 10 +++++---
>  5 files changed, 72 insertions(+), 3 deletions(-)

...

> diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
> index c22fb55abcfd..13d60a91eaa0 100644
> --- a/arch/x86/kernel/acpi/apei.c
> +++ b/arch/x86/kernel/acpi/apei.c
> @@ -43,3 +43,12 @@ void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
>  	apei_mce_report_mem_error(sev, mem_err);
>  #endif
>  }
> +
> +int arch_apei_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
> +{
> +	int err = -EINVAL;
> +#ifdef CONFIG_X86_MCE
> +	err = apei_mce_report_x86_error(ctx_info, lapic_id);
> +#endif
> +	return err;
> +}

Add a stub for apei_mce_report_x86_error() in
arch/x86/include/asm/mce.h, in one of the !CONFIG_X86_MCE ifdeffery
which returns -EINVAL and get rid of this ifdeffery and simply do:

	return apei_mce_report_x86_error(ctx_info, lapic_id);

here.

If you wanna fix the above apei_mce_report_mem_error() too, you can do
that in a separate patch.

> diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
> index af8d37962586..65001d342302 100644
> --- a/arch/x86/kernel/cpu/mce/apei.c
> +++ b/arch/x86/kernel/cpu/mce/apei.c
> @@ -26,6 +26,8 @@
>  
>  #include "internal.h"
>  
> +#define MASK_MCA_STATUS 0xC0002001

What does that mask mean? Either here or where it is used needs a
comment.

>  void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
>  {
>  	struct mce m;
> @@ -51,6 +53,46 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
>  }
>  EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
>  
> +int apei_mce_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
> +{
> +	const u64 *i_mce = ((const void *) (ctx_info + 1));
> +	unsigned int cpu;
> +	struct mce m;
> +
> +	if (!boot_cpu_has(X86_FEATURE_SMCA))

If this function you're adding is SMCA-specific, then its name cannot be
as generic as it is now.

> +		return -EINVAL;
> +
> +	if ((ctx_info->msr_addr & MASK_MCA_STATUS) != MASK_MCA_STATUS)
> +		return -EINVAL;
> +
> +	mce_setup(&m);
> +
> +	m.extcpu = -1;
> +	m.socketid = -1;
> +
> +	for_each_possible_cpu(cpu) {
> +		if (cpu_data(cpu).initial_apicid == lapic_id) {

I don't like that but I don't think we have a reverse mapping from LAPIC
ID to logical CPU numbers in the kernel...

> +			m.extcpu = cpu;
> +			m.socketid = cpu_data(m.extcpu).phys_proc_id;

			m.socketid = cpu_data(cpu).phys_proc_id;

> +			break;
> +		}
> +	}
> +
> +	m.apicid = lapic_id;
> +	m.bank = (ctx_info->msr_addr >> 4) & 0xFF;
> +	m.status = *i_mce;
> +	m.addr = *(i_mce + 1);
> +	m.misc = *(i_mce + 2);
> +	/* Skipping MCA_CONFIG */
> +	m.ipid = *(i_mce + 4);
> +	m.synd = *(i_mce + 5);

Is that structure after cper_ia_proc_ctx defined somewhere in the UEFI
spec so that you can cast to it directly instead of doing this ugly
pointer arithmetic?

> +
> +	mce_log(&m);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(apei_mce_report_x86_error);

Why is this function exported?

If "no reason", you can fix the above one too, with a separate commit.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux