On 2018-08-24 9:27 AM, Rolf Eike Beer wrote:
Am 2018-08-24 14:09, schrieb Helge Deller:
> On 2018-08-21 2:31 PM, Rolf Eike Beer wrote:
With this patch I get this timing:
gcc7 - Time: 2018-08-24T01:42:07
gcc6 - Time: 2018-08-24T06:01:27
Somewhere in between 4.17.3 and 4.18.0.
Rolf, I think plain 4.18 kernel misses Dave's speed-up patches:
*
http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7797167ffde1f00446301cb22b37b7c03194cfaf
*
http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3b885ac1dc35b87a39ee176a6c7e2af9c789d8b8
Both patches have been scheduled to be added to 4.18-stable kernel.
I guess I better revert the previous patch from Dave, no?
I would say leave the patch. The big part of the slow down is the sync
barrier in the TLB handler. The above
patches don't address this issue. They should speed up spin locks in
general.
On PA 2.0 SMP machines, we need either a sync or an ordered store to
release a spin lock. Otherwise, the lock
may be released before the other accesses in the lock region are
complete. As a result, the operation isn't
atomic from the perspective of other CPUs. There's no getting around
this issue on PA 2.0 systems.
I plan to look more at using ordered loads and stores in the spin lock
code as they clearly don't impact performance
as much as sync.
Regarding the TLB code, it turned out we were always setting the page
accessed bit for user pages. So, the
code to set it when a user page is accessed is redundant. We need to
lock to update the accessed and dirty bits
atomically. We can keep the current TLB locking code and not set the
page accessed bit in our user page
defines. This should improve swap but the TLB handler is more complex.
Another alternative is to remove
the locking and accessed update code from the TLB handler. This
provides the best performance for TLB
inserts but swap performance will be worse since we don't track the
accessed bit.
Dave
--
John David Anglin dave.anglin@xxxxxxxx