Re: [RFC][PATCH v2] parisc: Add alternative coding when running UP

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018-10-16 4:51 PM, Helge Deller wrote:
On 16.10.2018 14:08, John David Anglin wrote:
On 2018-10-16 1:34 AM, Helge Deller wrote:
On 15.10.2018 23:11, James Bottomley wrote:
On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote:
This patch adds the necessary code to patch a running SMP kernel
at runtime to improve performance when running on a single CPU.

The current implementation offers two patching variants:
- Unwanted assembler statements like locking functions are
overwritten
    with NOPs. When multiple instructions shall be skipped, one branch
    instruction is used instead of multiple nop instructions.
This seems like a good idea because our spinlocks are particularly
heavyweight.

- Some pdtlb and pitlb instructions are patched to become pdtlb,l and
    pitlb,l which only flushes the CPU-local tlb entries instead of
    broadcasting the flush to other CPUs in the system and thus may
    improve performance.
I really don't think this matters: on a UP system, ptdlb,l and pdtlb
are the same instruction because the CPU already knows is has no
internal CPU bus to broadcast the purge over so it in effect executes a
pdtlb,l regardless.
I'd be happy to drop this part again.
But is that true on a SMP system, where one has booted with maxcpus=1, too?
I would like to see what happens on panama.  Panama is a rp3410. Currently, it takes
approximately 4042 cycles to flush one page (4096 bytes).  This is way more than the number
of cycles that I see on my rp3440.  My c3750 takes 450 cycles per page with patch.  It could
be ptdlb,l and pdtlb are equivalent on c3750.
Depends on what you flush.
On c3750 we may get fooled because the kernel area could have been mapped via huge pages,
while on rp34x0 the PA8900 CPU prevents huge pages for kernel.
That may explain the performance difference between c3750 and rp3410, but not
the difference to rp3440.
Regardless of whether the kernel area is mapped via huge pages, the loop uses PAGE_SIZE which is set to  4KB. I think there are 240 TLB entries on the above machines.  Does the size of the mapping matter?

I could see huge pages slowly the test as one would get a page fault after every purge.  Debian kernel
is built with CONFIG_HUGETLB_PAGE.

Is there something wrong with SMP on panama?
Oct  4 02:27:56 panama kernel: [    0.061736] smp: Bringing up secondary CPUs ...
Oct  4 02:27:56 panama kernel: [    0.061897] smp: Brought up 3 nodes, 1 CPU
Will check tomorrow.
I know replacing "sync and normal store" with ordered store in spin lock release makes a
significant difference in the above timing.  Plan to send patch tonight.
What exactly do you want me to test on panama?
pdtlb versus pdtlb,l.  It seems pdtlb is very slow on panama.
Is the git head with my latest for-next tree [1] OK ?
It's a moving target ;-)

Helge

[1] https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/log/?h=for-next


Dave

--
John David Anglin  dave.anglin@xxxxxxxx




[Index of Archives]     [Linux SoC]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux