Re: [PATCH] parisc: fix race conditions flushing user cache pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/30/2013 11:11 AM, James Bottomley wrote:
On Thu, 2013-05-30 at 17:01 +0200, Helge Deller wrote:
On 05/30/2013 04:55 PM, James Bottomley wrote:
On Tue, 2013-05-28 at 08:09 -0400, John David Anglin wrote:
There are two issues addressed by this patch:

1) When we flush user instruction and data pages using the kernel
tmpalias region, we need to
ensure that the local TLB entry for the tmpalias page is never
replaced with a different mapping.
Previously, we purged the entry before and after the cache flush
loop.  Although preemption was
disabled, it seemed possible that the value might change during
interrupt processing.  The patch
removes the purge and disables interrupts during the initial TLB entry
purge and cache flush.

2) In a number of places, we flush the TLB for the page and then flush
the page.  We disabled
preemption around the flush.  This change disables preemption around
both the TLB and cache
flushes as it seemed the effect of the purge might be lost.

Without this change, I saw four random segmentation faults in about
1.5 days of intensive package
building last weekend.  With the change, I haven't seen a single
random segmentation fault in about
one week of building Debian packages on 4-way rp3440.  So, there is a
significant improvement
in system stability.
an rp3440 is PA2.0, so you weren't really testing any of the tlb purge
locking changes.
Which kind of system do we need to test those "tlb purge locking changes" (PAx.y)?
the #ifdef is CONFIG_PA20, so any 1.x system should work.  The actual
locks are only needed for the Merced systems, which are N class because
they have an architectural bug where there can only be one purge active
on the CPU bus at any one time.
With respect to the functions doing copies and flushes via the tmpalias region, it's my theory that the locks aren't needed for local purges on PA 2.0. This was implemented
previously and isn't relevant to current change.

For global TLB purges, I think the problem is also present on rp3440 (i.e., disabling
the locks causes problems).  I'm fairly sure that I tried this.

At least I can confirm, that Dave's patches have made all my systems
absolutely stable.
Can we try the proper fix before things like disabling interrupts.
copy_page_asm and clear_page_asm don't have the necessary purges at the
end.  I don't think this ever mattered for clear_page_asm since it's
only used in the memory backend for pages going to userspace (so not at
interrupt), but for some reason we've suddenly started using
copy_page_asm, which could be used at interrupt.

I honestly don't think we should be using copy_page_asm at all.  The
speed up we get from it isn't realised because we have to flush the
cache after we've finished to try and make I/D coherence.  This makes
the call the same cost as the kmapped memcpy but requires twice the
tmpalias space to be allocated because of the from and to mappings.

James

Also, I don't know what happened, but the actual tmpalias theory
requires a TLB purge before and after and I though we used to have them.
The reason is twofold:

      1. You don't want the caches to speculate in the tmpalias region
      2. A flush after makes the routines interrupt safe (because you can
         interrupt in a tmpalias operation, do another tmpalias
         operation, purge the cache and restart within the non interrupt
         tmpalias and expect everything to work).
I had questioned whether this is true. In any case, if an interrupt occurs, the TLB entry needs a reload. So, if latency isn't an issue, it's likely better to disable interrupts. One also doesn't
need after TLB purge.

The copy and clear routines never had TLB purges after the loops. That needs fixing
if we follow your suggestion.  At the moment, these routines are unused.


Trying to disable interrupts sounds like problem 2.  Can we return to
the proper tmpalias operations rather than trying to hack around them?
I'm willing to test to see if adding the purges to clear_user_page and copy_user_page helps. They just need to call flush_kernel_dcache_page_addr instead of flush_kernel_dcache_page_asm.

Dave

--
John David Anglin    dave.anglin@xxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SoC]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux