Re: [RFC][PATCH v2] parisc: Add alternative coding when running UP

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 16.10.2018 23:45, John David Anglin wrote:
> On 2018-10-16 4:51 PM, Helge Deller wrote:
>> On 16.10.2018 14:08, John David Anglin wrote:
>>> On 2018-10-16 1:34 AM, Helge Deller wrote:
>>>> On 15.10.2018 23:11, James Bottomley wrote:
>>>>> On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote:
>>>>>> This patch adds the necessary code to patch a running SMP kernel
>>>>>> at runtime to improve performance when running on a single CPU.
>>>>>>
>>>>>> The current implementation offers two patching variants:
>>>>>> - Unwanted assembler statements like locking functions are
>>>>>> overwritten
>>>>>>     with NOPs. When multiple instructions shall be skipped, one branch
>>>>>>     instruction is used instead of multiple nop instructions.
>>>>> This seems like a good idea because our spinlocks are particularly
>>>>> heavyweight.
>>>>>
>>>>>> - Some pdtlb and pitlb instructions are patched to become pdtlb,l and
>>>>>>     pitlb,l which only flushes the CPU-local tlb entries instead of
>>>>>>     broadcasting the flush to other CPUs in the system and thus may
>>>>>>     improve performance.
>>>>> I really don't think this matters: on a UP system, ptdlb,l and pdtlb
>>>>> are the same instruction because the CPU already knows is has no
>>>>> internal CPU bus to broadcast the purge over so it in effect executes a
>>>>> pdtlb,l regardless.
>>>> I'd be happy to drop this part again.
>>>> But is that true on a SMP system, where one has booted with maxcpus=1, too?
>>> I would like to see what happens on panama.  Panama is a rp3410. Currently, it takes
>>> approximately 4042 cycles to flush one page (4096 bytes).  This is way more than the number
>>> of cycles that I see on my rp3440.  My c3750 takes 450 cycles per page with patch.  It could
>>> be ptdlb,l and pdtlb are equivalent on c3750.
>> Depends on what you flush.
>> On c3750 we may get fooled because the kernel area could have been mapped via huge pages,
>> while on rp34x0 the PA8900 CPU prevents huge pages for kernel.
>> That may explain the performance difference between c3750 and rp3410, but not
>> the difference to rp3440.
> Regardless of whether the kernel area is mapped via huge pages, the loop uses PAGE_SIZE which is set to  4KB.
> I think there are 240 TLB entries on the above machines.  Does the size of the mapping matter?
> 
> I could see huge pages slowly the test as one would get a page fault after every purge.  Debian kernel
> is built with CONFIG_HUGETLB_PAGE.

Here are some numbers for a L3000 (rp5470) and panama (rp3410):

rp5470:
cpu family      : PA-RISC 2.0
cpu             : PA8700 (PCX-W2)
cpu MHz         : 875.000000
capabilities    : os64 iopdir_fdc nva_supported (0x05)
model           : 9000/800/L3000-8x
model name      : Marcato W+ (rp5470)?
I-cache         : 768 KB
D-cache         : 1536 KB (WB, direct mapped)
ITLB entries    : 240
DTLB entries    : 240 - shared with ITLB

4.19.0-rc8-64bit+ (plain Linux git head)
[    3.909737] CPU(s): 4 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online
[    3.921368] Whole cache flush 632875 cycles, flushing 19197952 bytes 9371547 cycles
[    3.921387] Cache flush threshold set to 1266 KiB
[    3.922510] TLB flush threshold set to 240 KiB

4.19.0-rc7-64bit+ (all for-next patches incl. Dave's, but without alternative patching):
[    4.143616] CPU(s): 4 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online
[    4.154970] Whole cache flush 629995 cycles, flushing 19173376 bytes 9103971 cycles
[    4.154992] Cache flush threshold set to 1295 KiB
[    4.155181] TLB flush threshold set to 240 KiB

4.19.0-rc7-64bit+ (all for-next patches incl. Dave's and WITH alternative patching):
[    4.143193] CPU(s): 4 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online
[   28.327580] Whole cache flush 665022 cycles, flushing 19169280 bytes 9328514 cycles
[   28.327621] Cache flush threshold set to 1334 KiB
[   28.327828] TLB flush threshold set to 240 KiB

4.19.0-rc7-64bit+ (all for-next patches incl. Dave's and WITH alternative patching), booted with "maxcpus=1":
[    4.117685] CPU(s): 1 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online
[   38.509965] Cache flush threshold set to 828 KiB
[   38.511763] Whole TLB flush 11664 cycles, Range flush 19169280 bytes 1384989 cycles
[   38.624308] Calculated TLB flush threshold 160 KiB
[   38.624477] TLB flush threshold set to 160 KiB



panama:
cpu family      : PA-RISC 2.0
cpu             : PA8900 (Shortfin)
cpu MHz         : 800.002200
capabilities    : os64 iopdir_fdc needs_equivalent_aliasing (0x35)
model           : 9000/800/rp3410
model name      : Storm Peak DC- Slow Mako+
I-cache         : 65536 KB
D-cache         : 65536 KB (WB, direct mapped)
ITLB entries    : 240
DTLB entries    : 240 - shared with ITLB
bogomips        : 1594.36

Debian kernel: 4.18.0-2-parisc64-smp
[    1.144459] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online
[    1.153842] Cache flush threshold set to 39768 KiB
[    1.177785] Whole TLB flush 6231 cycles, Range flush 18874368 bytes 18987500 cycles
[    1.178038] Calculated TLB flush threshold 8 KiB
[    1.178411] TLB flush threshold set to 512 KiB

4.19.0-rc8-64bit+ (vmlinuz-4.19-rc8) plain git head, no parisc patches
[    1.105625] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online
[    1.115685] Whole cache flush 4271732 cycles, flushing 19197952 bytes 2028353 cycles
[    1.115702] Cache flush threshold set to 39483 KiB
[    1.136859] Whole TLB flush 6189 cycles, flushing 19197952 bytes 16779052 cycles
[    1.136869] TLB flush threshold set to 8 KiB

4.19.0-rc7-64bit+ (all for-next patches incl. Dave's, but without alternative patching): (vmlinuz-4.19-rc7-noalternative)
[    1.233597] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online
[    1.243121] Whole cache flush 4268326 cycles, flushing 19173376 bytes 2041619 cycles
[    1.243137] Cache flush threshold set to 39145 KiB
[    1.262504] Whole TLB flush 5430 cycles, Range flush 19173376 bytes 15324980 cycles
[    1.262758] Calculated TLB flush threshold 8 KiB
[    1.263126] TLB flush threshold set to 16 KiB

4.19.0-rc7-64bit+ (all for-next patches incl. Dave's and WITH alternative patching): (vmlinuz-4.19-rc7-for-next)
[    1.181601] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online
[    2.662065] Whole cache flush 4287666 cycles, flushing 19169280 bytes 2040021 cycles
[    2.662087] Cache flush threshold set to 39345 KiB
[    2.663462] Whole TLB flush 7563 cycles, Range flush 19169280 bytes 940355 cycles
[    2.663718] Calculated TLB flush threshold 152 KiB
[    2.665174] TLB flush threshold set to 152 KiB


>>> Is there something wrong with SMP on panama?
>>> Oct  4 02:27:56 panama kernel: [    0.061736] smp: Bringing up secondary CPUs ...
>>> Oct  4 02:27:56 panama kernel: [    0.061897] smp: Brought up 3 nodes, 1 CPU
>> Will check tomorrow.

I think this is triggered by the 3 memory ranges which firmware report on panama:
[    0.000000] Memory Ranges:
[    0.000000]  0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
[    0.000000]  1) Start 0x0000000100000000 End 0x000000013fdfffff Size   1022 MB
[    0.000000]  2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB
[    0.000000] Total Memory: 5118 MB

In arch/parisc/mm/init.c:296 we have:
        for (i = 0; i < npmem_ranges; i++) {
                node_set_state(i, N_NORMAL_MEMORY);
                node_set_online(i);
        }

Not sure if it's worth fixing...(or even needs fixing).

>>> I know replacing "sync and normal store" with ordered store in spin lock release makes a
>>> significant difference in the above timing.  Plan to send patch tonight.
>> What exactly do you want me to test on panama?
> pdtlb versus pdtlb,l.  It seems pdtlb is very slow on panama.

Check numbers above.

Helge



[Index of Archives]     [Linux SoC]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux