ARC compact700 NPS platform - EZ_MachineCheck exception handler

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



There are two cases to consider for this exception:

> but others can't so continuing despite it is recipe for disaster. Perhaps your chip
> has some spurious Machine check exceptions ?

1. Except for core 0, which is running the linux os, all other cores are running 
packet processing code in ZOL isolation mode. If any of these cores hit the compact 700
0x20 exception it is logical to assume all other cores will hit it too.
It seems that eventually in any case, will have to reset HW and reboot the system.
It might be beneficial for user to try collect more info for debugging the issue even if it?s 
a disaster for the system.

> Hmm, but you have to explain why those machine checks are fine !

2. The ARC compact700 instruction set was extended to support fast DMA 
operations to various added HW accelerators and new asm ops to support network 
packet Processing.
In case of an error,  some of these instructions are wired to the 0x20 exception.
There is an HW mechanism to partition the DDR between linux os and the various accelerators
This mechanism unaware of the mmu or virtual memory handling. 
In a cases where an accelerator access out of its memory bounds this exception is hit 
but there is no risk to system stability. User signal handler can catch it allowing easier 
debugging.
This is one example.



> >   1:
> >   	FAKE_RET_FROM_EXCPN
> 
> You don't need this.

When removing FAKE_RET_FROM_EXCPN, first EV_MachineCheck exception 
Is causing the core running that thread to stall.
If not removed multiple exceptions are handled and system seems healthy.

Please note that exception is generated by accessing one of the NPS accelerators
address which is out its memory space, so no harm is expected to system 


> Next time please send a real patch so I know right away what was changed.
My apologies, here is the patch based on linux-4.16.10

diff -uprN linux-4.16.10/arch/arc/kernel/entry.S linux/arch/arc/kernel/entry.S
--- linux-4.16.10/arch/arc/kernel/entry.S	2018-05-19 11:19:37.000000000 +0300
+++ linux/arch/arc/kernel/entry.S	2018-05-22 14:12:18.065103918 +0300
@@ -106,13 +106,9 @@ ENTRY(EV_MachineCheck)
 	b       ret_from_exception
 
 1:
-	; DEAD END: can't do much, display Regs and HALT
-	SAVE_CALLEE_SAVED_USER
-
-	GET_CURR_TASK_FIELD_PTR   TASK_THREAD, r10
-	st  sp, [r10, THREAD_CALLEE_REG]
-
-	j  do_machine_check_fault
+	FAKE_RET_FROM_EXCPN
+	bl		do_machine_check
+	b       ret_from_exception
 
 END(EV_MachineCheck)
 
diff -uprN linux-4.16.10/arch/arc/kernel/traps.c linux/arch/arc/kernel/traps.c
--- linux-4.16.10/arch/arc/kernel/traps.c	2018-05-19 11:19:37.000000000 +0300
+++ linux/arch/arc/kernel/traps.c	2018-05-22 14:13:25.162748373 +0300
@@ -86,6 +86,7 @@ DO_ERROR_INFO(SIGBUS, "Invalid Mem Acces
 DO_ERROR_INFO(SIGTRAP, "Breakpoint Set", trap_is_brkpt, TRAP_BRKPT)
 DO_ERROR_INFO(SIGBUS, "Misaligned Access", do_misaligned_error, BUS_ADRALN)
 DO_ERROR_INFO(SIGSEGV, "gcc generated __builtin_trap", do_trap5_error, 0)
+DO_ERROR_INFO(SIGBUS, "Machine Check", do_machine_check, BUS_MCEERR_AR )
 
 /*
  * Entry Point for Misaligned Data access Exception, for emulating in software






> -----Original Message-----
> From: Vineet Gupta [mailto:Vineet.Gupta1 at synopsys.com]
> Sent: Monday, May 21, 2018 19:59
> To: Ofer Levi(SW) <oferle at mellanox.com>
> Cc: linux-kernel at vger.kernel.org; Meir Lichtinger <meirl at mellanox.com>;
> arcml <linux-snps-arc at lists.infradead.org>
> Subject: Re: ARC compact700 NPS platform - EZ_MachineCheck exception
> handler
> 
> On 05/21/2018 07:14 AM, Ofer Levi(SW) wrote:
> > Resending, due to typo in LKML mail  address.
> 
> Also please CC linux-snps-arc at lists.infradead.org for any ARC Linux related
> posts.
> 
> >
> >   The EV_MachineCheck exception handler is halting the core for
> exceptions
> >   which are not tlb_overlap_fault.
> >   Since for the NPS platform each core is running a single thread in ZOL (Zero
> >   Overhead Linux) isolation mode, we feel that most of the time it is safe to
> >   resume execution instead of halting the core.
> 
> Most of the time is not good enough when dealing with OS code :-( A
> Machine check excepting implies something went terribly wrong. Some of
> those cases can be handled gracefully (such as duplicate TLB entry), but
> others can't so continuing despite it is recipe for disaster. Perhaps your chip
> has some spurious Machine check exceptions ?
> 
> >   I would appreciate it if you could review the change  below
> 
> Next time please send a real patch so I know right away what was changed.
> 
> > and let me know
> >   what you think, if this change is valid or if we missed or overlooked
> >   something.
> >   We are not looking to push this change upstream, but will be used on
> some
> >   systems.
> 
> Hmm, but you have to explain why those machine checks are fine !
> 
> >
> >   Please see below our implementation after label 1.
> >
> >   Thanks
> >   Ofer
> >
> >   ENTRY(EV_MachineCheck)
> >
> >   	EXCEPTION_PROLOGUE
> >
> > ...
> >   	brne    r3, ECR_C_MCHK_DUP_TLB, 1f
> >
> >   	bl      do_tlb_overlap_fault
> >   	b       ret_from_exception
> >
> >   1:
> >   	FAKE_RET_FROM_EXCPN
> 
> You don't need this.
> 
> >   	bl		do_machine_check  ; using DO_ERROR_INFO macro
> 
> We don't have above function in code. There's do_machine_check_fault()
> which calls
> die() -> flag 1 - so it would halt the kernel and would never return here.
> So your patch is broken in implementation as well.
> 
> >   	b       ret_from_exception
> >
> >   END(EV_MachineCheck)
> >
> >



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux