Re: [parisc] A500 boot crash with 44786880df196a4200c178945c4d41675faf9fb7

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018-10-25 2:28 AM, Helge Deller wrote:
* John David Anglin <dave.anglin@xxxxxxxx>:
On 2018-10-24 3:34 PM, Helge Deller wrote:
On 24.10.2018 19:29, John David Anglin wrote:
On 2018-10-24 11:45 AM, Helge Deller wrote:
On 24.10.2018 17:03, John David Anglin wrote:
On 2018-10-24 7:59 AM, John David Anglin wrote:
The fault occured executing this instruction "stw r31,0(r25)". Register r31 contains the following
instruction "pdtlb,l r0(sr1,r3)".  This indicates the fault occurred during alternative patching.

I suspect all kernel TLB entries need to be flushed prior to alternative patching to ensure that kernel
pages are writeable.
Looks like this is a problem with set_kernel_text_rw().  Maybe this causes problems:

int __flush_tlb_range(unsigned long sid, unsigned long start,
                         unsigned long end)
{
           unsigned long flags;

           if ((!IS_ENABLED(CONFIG_SMP) || !arch_irqs_disabled()) &&
               end - start >= parisc_tlb_flush_threshold) {
                   flush_tlb_all();
                   return 1;
           }

I believe that we need to disable this optimization until the parisc_tlb_flush_threshold is
calculated.  I think this crash is related to the occasional crash in parisc_setup_cache_timing().

Maybe change in cache.c the initial define for parisc_tlb_flush_threshold:
static unsigned long parisc_tlb_flush_threshold __read_mostly = ~0UL;
If it would run into flush_tlb_all(), then I'd expect that all TLBs have been flushed and
we wouldn't see an issue.
Maybe the info in the cache_info struct, which is used in the assembly of flush_tlb_all_local(),
hasn't been initialized yet and such the whole cache hasn't been flushed?
Since the fault occurred before the write bit is removed, it seems to
me that the only way this can happen is that the TLB entry is left
over from a previous instantiation of the OS.
Agreed.

parisc_kernel_start() doesn't seem to whack TLB.  This suggests that
__flush_tlb_range() call in set_kernel_text_rw() didn't work as
expected.
Yes, seems so.
This system has only one CPU, so one flush_tlb_all_local() should have been sufficient.
Yes, I believe only one CPU is up at this point.
On a SMP system, more may be up, but in a waiting loop...
I tested using flush_tlb_all_local() on my c8000 (SMP with four CPUs) and it booted okay a couple of times.

   Maybe this should be tried instead of the range flush.  Should be
faster.
Here are some ideas:

I really think we need to include the __init text section too.  Built-in
device divers may use asm_io_fdc() or asm_io_sync() in their init code.

Should we flush the d-cache before mapping r/o again?
I think so.  We want to ensure the new instructions are flushed to memory.

As discussed abobe, maybe we should simply use flush_tlb_all_local()?
It's on the current CPU which will do the patching and needs those new
tlb entries.

All of the above doesn't explain why Meelis had a crash at boot though...
Space register sr1 is racy - that is its value could change if an interruption occurs between when it is loaded and when it is used.  I think the only way the routine fails is for sr1 to get corrupted.

The PA 1.1 TLB handlers use sr1 but the value is saved and restored.  The PA 2.0 handlers don't
use sr1.

I looked at the code to make sure this wasn't caused by gcc reordering code.  However, sr1 is loaded immediately before the pdtlb and pitlb instructions.  External interrupts are turned off.

Do we need more instructions between turning interrupts off and the instruction to load sr1 due to instruction reordering on PA 2.0?  If the rsm in the spin lock and the mtsp use different general registers, I'm not sure there's an ordering dependency.  So, the mtsp might execute before interrupts are disabled
by the rsm instruction.  The IRB has 26 entries on PA 2.0.

Using flush_tlb_all_local() likely will fix the fault during the alternative patching, but the same issue
is likely to occur running user code and cause random faults in user code.

For kernel range flushes, we don't need to use sr1.  The code should be rewritten to use implicit
space register selection.

Signed-off-by: Helge Deller <deller@xxxxxx>

diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index e7e626bcd0be..ddf5c6231fb0 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -513,14 +513,18 @@ static void __init map_pages(unsigned long start_vaddr,
void __init set_kernel_text_rw(int enable_read_write)
  {
-	unsigned long start = (unsigned long)_stext;
+	unsigned long start = (unsigned long)__init_begin;
  	unsigned long end   = (unsigned long)_etext;
+ /* flush modified instructions before mapping ro */
+	if (!enable_read_write)
+		flush_kernel_dcache_range_asm(start, end);
+
  	map_pages(start, __pa(start), end-start,
  		PAGE_KERNEL_RWX, enable_read_write ? 1:0);
/* force the kernel to see the new TLB entries */
-	__flush_tlb_range(0, start, end);
+	flush_tlb_all_local(NULL);
/* dump old cached instructions */
  	flush_icache_range(start, end);


Dave

--
John David Anglin  dave.anglin@xxxxxxxx




[Index of Archives]     [Linux SoC]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux