Hi Thomas, On Fri, 08 Dec 2023 12:52:49 +0100, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > > Without PIR copy: > > > > DMA memfill bandwidth: 4.944 Gbps > > Performance counter stats for './run_intr.sh 512 30': > > > > 77,313,298,506 L1-dcache-loads > > (79.98%) 8,279,458 L1-dcache-load-misses # > > 0.01% of all L1-dcache accesses (80.03%) 41,654,221,245 > > L1-dcache-stores (80.01%) > > 10,476 LLC-load-misses # 0.31% of all LL-cache > > accesses (79.99%) 3,332,748 LLC-loads > > (80.00%) 30.212055434 seconds time elapsed > > > > 0.002149000 seconds user > > 30.183292000 seconds sys > > > > > > With PIR copy: > > DMA memfill bandwidth: 5.029 Gbps > > Performance counter stats for './run_intr.sh 512 30': > > > > 78,327,247,423 L1-dcache-loads > > (80.01%) 7,762,311 L1-dcache-load-misses # > > 0.01% of all L1-dcache accesses (80.01%) 42,203,221,466 > > L1-dcache-stores (79.99%) > > 23,691 LLC-load-misses # 0.67% of all LL-cache > > accesses (80.01%) 3,561,890 LLC-loads > > (80.00%) > > > > 30.201065706 seconds time elapsed > > > > 0.005950000 seconds user > > 30.167885000 seconds sys > > Interesting, though I'm not really convinced that this DMA memfill > microbenchmark resembles real work loads. > > Did you test with something realistic, e.g. storage or networking, too? I have done the following FIO test on NVME drives and not seeing any meaningful differences in IOPS between the two implementations. Here is my setup and results on 4 NVME drives connected to a x16 PCIe slot: +-[0000:62]- | +-01.0-[63]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM174X | +-03.0-[64]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM174X | +-05.0-[65]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM174X | \-07.0-[66]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM174X libaio, no PIR_COPY ====================================== fio-3.35 Starting 512 processes Jobs: 512 (f=512): [r(512)][100.0%][r=32.2GiB/s][r=8445k IOPS][eta 00m:00s] disk_nvme6n1_thread_1: (groupid=0, jobs=512): err= 0: pid=31559: Mon Jan 8 21:49:22 2024 read: IOPS=8419k, BW=32.1GiB/s (34.5GB/s)(964GiB/30006msec) slat (nsec): min=1325, max=115807k, avg=42368.34, stdev=1517031.57 clat (usec): min=2, max=499085, avg=15139.97, stdev=25682.25 lat (usec): min=68, max=499089, avg=15182.33, stdev=25709.81 clat percentiles (usec): | 1.00th=[ 734], 5.00th=[ 783], 10.00th=[ 816], 20.00th=[ 857], | 30.00th=[ 906], 40.00th=[ 971], 50.00th=[ 1074], 60.00th=[ 1369], | 70.00th=[ 13042], 80.00th=[ 19792], 90.00th=[ 76022], 95.00th=[ 76022], | 99.00th=[ 77071], 99.50th=[ 81265], 99.90th=[ 85459], 99.95th=[ 91751], | 99.99th=[200279] bw ( MiB/s): min=18109, max=51859, per=100.00%, avg=32965.98, stdev=16.88, samples=14839 iops : min=4633413, max=13281470, avg=8439278.47, stdev=4324.70, samples=14839 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01% lat (usec) : 250=0.01%, 500=0.01%, 750=1.84%, 1000=41.96% lat (msec) : 2=18.37%, 4=0.20%, 10=3.88%, 20=13.95%, 50=5.42% lat (msec) : 100=14.33%, 250=0.02%, 500=0.01% cpu : usr=1.16%, sys=3.54%, ctx=4932752, majf=0, minf=192764 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=252616589,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=256 Run status group 0 (all jobs): READ: bw=32.1GiB/s (34.5GB/s), 32.1GiB/s-32.1GiB/s (34.5GB/s-34.5GB/s), io=964GiB (1035GB), run=30006-30006msec Disk stats (read/write): nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=96.31% nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=97.15% nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.06% nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.94% Performance counter stats for 'system wide': 22,985,903,515 L1-dcache-load-misses (42.86%) 22,989,992,126 L1-dcache-load-misses (57.14%) 751,228,710,993 L1-dcache-stores (57.14%) 465,033,820 LLC-load-misses # 18.27% of all LL-cache accesses (57.15%) 2,545,570,669 LLC-loads (57.14%) 1,058,582,881 LLC-stores (28.57%) 326,135,823 LLC-store-misses (28.57%) 32.045718194 seconds time elapsed ------------------------------------------- libaio with PIR_COPY ------------------------------------------- fio-3.35 Starting 512 processes Jobs: 512 (f=512): [r(512)][100.0%][r=32.2GiB/s][r=8445k IOPS][eta 00m:00s] disk_nvme6n1_thread_1: (groupid=0, jobs=512): err= 0: pid=5103: Mon Jan 8 23:12:12 2024 read: IOPS=8420k, BW=32.1GiB/s (34.5GB/s)(964GiB/30011msec) slat (nsec): min=1339, max=97021k, avg=42447.84, stdev=1442726.09 clat (usec): min=2, max=369410, avg=14820.01, stdev=24112.59 lat (usec): min=69, max=369412, avg=14862.46, stdev=24139.33 clat percentiles (usec): | 1.00th=[ 717], 5.00th=[ 783], 10.00th=[ 824], 20.00th=[ 873], | 30.00th=[ 930], 40.00th=[ 1012], 50.00th=[ 1172], 60.00th=[ 8094], | 70.00th=[ 14222], 80.00th=[ 18744], 90.00th=[ 76022], 95.00th=[ 76022], | 99.00th=[ 76022], 99.50th=[ 78119], 99.90th=[ 81265], 99.95th=[ 81265], | 99.99th=[135267] bw ( MiB/s): min=19552, max=62819, per=100.00%, avg=33774.56, stdev=31.02, samples=14540 iops : min=5005807, max=16089892, avg=8646500.17, stdev=7944.42, samples=14540 lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01% lat (usec) : 250=0.01%, 500=0.01%, 750=2.50%, 1000=36.41% lat (msec) : 2=17.39%, 4=0.27%, 10=5.83%, 20=18.94%, 50=5.59% lat (msec) : 100=13.06%, 250=0.01%, 500=0.01% cpu : usr=1.20%, sys=3.74%, ctx=6758326, majf=0, minf=193128 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=252677827,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=256 Run status group 0 (all jobs): READ: bw=32.1GiB/s (34.5GB/s), 32.1GiB/s-32.1GiB/s (34.5GB/s-34.5GB/s), io=964GiB (1035GB), run=30011-30011msec Disk stats (read/write): nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=96.36% nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=97.18% nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.08% nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.96% Performance counter stats for 'system wide': 24,762,800,042 L1-dcache-load-misses (42.86%) 24,764,415,765 L1-dcache-load-misses (57.14%) 756,096,467,595 L1-dcache-stores (57.14%) 483,611,270 LLC-load-misses # 16.21% of all LL-cache accesses (57.14%) 2,982,610,898 LLC-loads (57.14%) 1,283,077,818 LLC-stores (28.57%) 313,253,711 LLC-store-misses (28.57%) 32.059810215 seconds time elapsed Thanks, Jacob