Re: Longstanding bug in our IRQ code (irqbalance HPMCs parisc SMP machines)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> > Sounds like it. Though the HPMCs are clearly different than the PTE  
> > issues
> > that jda/carlos are seeing.
> 
> True, but I remember Debian staff complaining about random unexplained  
> hangs, I wouldn't be too surprised if this came into play...

I have been looking at the PTE/TLB code for the last couple of weeks.
I now believe that the majority of the random hangs, segvs at program
startup are related to problems with the PTE/TLB code.

1) minifail bug

As previously identified, there is a cache flush problem in ptep_set_wrprotect.
I did a kernel build with a "WARN_ON(pte_present(old_pte) && pte_dirty(old_pte)"in ptep_set_wrprotect an it triggered immediately.  So, we have to deal with
a dirty cache in ptep_set_wrprotect.  As far as I can tell, this problem
is fixed by putting the flush inside preempt_disable()/preempt_enable().

Helge is triggering other PTE issues when he runs multiple copies of
minifail.  The minifail program fails on a UP system.  With the above fix,
it doesn't fail.

2) SMP page table entry corruption

The needs to be a lock around pte updates to ensure that pte modifications
are consistent on SMP machines.  I have done this and it helps stability
on rp3440.

3) SMP PTE/TLB consistency

We are not purging existing translations when an update to a PTE value
is done such as in ptep_set_wrprotect.  As a result, the parent of a fork
doesn't immediately trigger a COW break on a page when a write occurs
if there is an existing translation.

Adding a TLB purge helps, but even this may be lost.  For example,

page fault (cpu1)
  -> load old pte (cpu1)
    -> store new pte (cpu2)
      -> global tlb purge (cpu2)
        -> insert old pte into tlb (cpu1)

It may be possible to use locking to ensure that purges aren't lost.  However,
this is a bit tricky...

Dave
-- 
J. David Anglin                                  dave.anglin@xxxxxxxxxxxxxx
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux SoC]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux