On 16.10.2018 23:45, John David Anglin wrote: > On 2018-10-16 4:51 PM, Helge Deller wrote: >> On 16.10.2018 14:08, John David Anglin wrote: >>> On 2018-10-16 1:34 AM, Helge Deller wrote: >>>> On 15.10.2018 23:11, James Bottomley wrote: >>>>> On Sun, 2018-10-14 at 20:34 +0200, Helge Deller wrote: >>>>>> This patch adds the necessary code to patch a running SMP kernel >>>>>> at runtime to improve performance when running on a single CPU. >>>>>> >>>>>> The current implementation offers two patching variants: >>>>>> - Unwanted assembler statements like locking functions are >>>>>> overwritten >>>>>> with NOPs. When multiple instructions shall be skipped, one branch >>>>>> instruction is used instead of multiple nop instructions. >>>>> This seems like a good idea because our spinlocks are particularly >>>>> heavyweight. >>>>> >>>>>> - Some pdtlb and pitlb instructions are patched to become pdtlb,l and >>>>>> pitlb,l which only flushes the CPU-local tlb entries instead of >>>>>> broadcasting the flush to other CPUs in the system and thus may >>>>>> improve performance. >>>>> I really don't think this matters: on a UP system, ptdlb,l and pdtlb >>>>> are the same instruction because the CPU already knows is has no >>>>> internal CPU bus to broadcast the purge over so it in effect executes a >>>>> pdtlb,l regardless. >>>> I'd be happy to drop this part again. >>>> But is that true on a SMP system, where one has booted with maxcpus=1, too? >>> I would like to see what happens on panama. Panama is a rp3410. Currently, it takes >>> approximately 4042 cycles to flush one page (4096 bytes). This is way more than the number >>> of cycles that I see on my rp3440. My c3750 takes 450 cycles per page with patch. It could >>> be ptdlb,l and pdtlb are equivalent on c3750. >> Depends on what you flush. >> On c3750 we may get fooled because the kernel area could have been mapped via huge pages, >> while on rp34x0 the PA8900 CPU prevents huge pages for kernel. >> That may explain the performance difference between c3750 and rp3410, but not >> the difference to rp3440. > Regardless of whether the kernel area is mapped via huge pages, the loop uses PAGE_SIZE which is set to 4KB. > I think there are 240 TLB entries on the above machines. Does the size of the mapping matter? > > I could see huge pages slowly the test as one would get a page fault after every purge. Debian kernel > is built with CONFIG_HUGETLB_PAGE. Here are some numbers for a L3000 (rp5470) and panama (rp3410): rp5470: cpu family : PA-RISC 2.0 cpu : PA8700 (PCX-W2) cpu MHz : 875.000000 capabilities : os64 iopdir_fdc nva_supported (0x05) model : 9000/800/L3000-8x model name : Marcato W+ (rp5470)? I-cache : 768 KB D-cache : 1536 KB (WB, direct mapped) ITLB entries : 240 DTLB entries : 240 - shared with ITLB 4.19.0-rc8-64bit+ (plain Linux git head) [ 3.909737] CPU(s): 4 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online [ 3.921368] Whole cache flush 632875 cycles, flushing 19197952 bytes 9371547 cycles [ 3.921387] Cache flush threshold set to 1266 KiB [ 3.922510] TLB flush threshold set to 240 KiB 4.19.0-rc7-64bit+ (all for-next patches incl. Dave's, but without alternative patching): [ 4.143616] CPU(s): 4 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online [ 4.154970] Whole cache flush 629995 cycles, flushing 19173376 bytes 9103971 cycles [ 4.154992] Cache flush threshold set to 1295 KiB [ 4.155181] TLB flush threshold set to 240 KiB 4.19.0-rc7-64bit+ (all for-next patches incl. Dave's and WITH alternative patching): [ 4.143193] CPU(s): 4 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online [ 28.327580] Whole cache flush 665022 cycles, flushing 19169280 bytes 9328514 cycles [ 28.327621] Cache flush threshold set to 1334 KiB [ 28.327828] TLB flush threshold set to 240 KiB 4.19.0-rc7-64bit+ (all for-next patches incl. Dave's and WITH alternative patching), booted with "maxcpus=1": [ 4.117685] CPU(s): 1 out of 4 PA8700 (PCX-W2) at 875.000000 MHz online [ 38.509965] Cache flush threshold set to 828 KiB [ 38.511763] Whole TLB flush 11664 cycles, Range flush 19169280 bytes 1384989 cycles [ 38.624308] Calculated TLB flush threshold 160 KiB [ 38.624477] TLB flush threshold set to 160 KiB panama: cpu family : PA-RISC 2.0 cpu : PA8900 (Shortfin) cpu MHz : 800.002200 capabilities : os64 iopdir_fdc needs_equivalent_aliasing (0x35) model : 9000/800/rp3410 model name : Storm Peak DC- Slow Mako+ I-cache : 65536 KB D-cache : 65536 KB (WB, direct mapped) ITLB entries : 240 DTLB entries : 240 - shared with ITLB bogomips : 1594.36 Debian kernel: 4.18.0-2-parisc64-smp [ 1.144459] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online [ 1.153842] Cache flush threshold set to 39768 KiB [ 1.177785] Whole TLB flush 6231 cycles, Range flush 18874368 bytes 18987500 cycles [ 1.178038] Calculated TLB flush threshold 8 KiB [ 1.178411] TLB flush threshold set to 512 KiB 4.19.0-rc8-64bit+ (vmlinuz-4.19-rc8) plain git head, no parisc patches [ 1.105625] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online [ 1.115685] Whole cache flush 4271732 cycles, flushing 19197952 bytes 2028353 cycles [ 1.115702] Cache flush threshold set to 39483 KiB [ 1.136859] Whole TLB flush 6189 cycles, flushing 19197952 bytes 16779052 cycles [ 1.136869] TLB flush threshold set to 8 KiB 4.19.0-rc7-64bit+ (all for-next patches incl. Dave's, but without alternative patching): (vmlinuz-4.19-rc7-noalternative) [ 1.233597] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online [ 1.243121] Whole cache flush 4268326 cycles, flushing 19173376 bytes 2041619 cycles [ 1.243137] Cache flush threshold set to 39145 KiB [ 1.262504] Whole TLB flush 5430 cycles, Range flush 19173376 bytes 15324980 cycles [ 1.262758] Calculated TLB flush threshold 8 KiB [ 1.263126] TLB flush threshold set to 16 KiB 4.19.0-rc7-64bit+ (all for-next patches incl. Dave's and WITH alternative patching): (vmlinuz-4.19-rc7-for-next) [ 1.181601] CPU(s): 1 out of 1 PA8900 (Shortfin) at 800.002200 MHz online [ 2.662065] Whole cache flush 4287666 cycles, flushing 19169280 bytes 2040021 cycles [ 2.662087] Cache flush threshold set to 39345 KiB [ 2.663462] Whole TLB flush 7563 cycles, Range flush 19169280 bytes 940355 cycles [ 2.663718] Calculated TLB flush threshold 152 KiB [ 2.665174] TLB flush threshold set to 152 KiB >>> Is there something wrong with SMP on panama? >>> Oct 4 02:27:56 panama kernel: [ 0.061736] smp: Bringing up secondary CPUs ... >>> Oct 4 02:27:56 panama kernel: [ 0.061897] smp: Brought up 3 nodes, 1 CPU >> Will check tomorrow. I think this is triggered by the 3 memory ranges which firmware report on panama: [ 0.000000] Memory Ranges: [ 0.000000] 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB [ 0.000000] 1) Start 0x0000000100000000 End 0x000000013fdfffff Size 1022 MB [ 0.000000] 2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB [ 0.000000] Total Memory: 5118 MB In arch/parisc/mm/init.c:296 we have: for (i = 0; i < npmem_ranges; i++) { node_set_state(i, N_NORMAL_MEMORY); node_set_online(i); } Not sure if it's worth fixing...(or even needs fixing). >>> I know replacing "sync and normal store" with ordered store in spin lock release makes a >>> significant difference in the above timing. Plan to send patch tonight. >> What exactly do you want me to test on panama? > pdtlb versus pdtlb,l. It seems pdtlb is very slow on panama. Check numbers above. Helge