On 02.04.2021 16.45, Roger Heflin wrote: > On Fri, Apr 2, 2021 at 4:13 AM Paul Menzel <pmenzel@xxxxxxxxxxxxx> wrote: >> >> Dear Linux folks, >> >> > >>> Are these values a good benchmark for comparing processors? >> >> After two years, yes they are. I created 16 10 GB files in `/dev/shm`, >> set them up as loop devices, and created a RAID6. For resync speed it >> makes difference. >> >> 2 x AMD EPYC 7601 32-Core Processor: 34671K/sec >> 2 x Intel Xeon Gold 6248 CPU @ 2.50GHz: 87533K/sec >> >> So, the current state of affairs seems to be, that AVX512 instructions >> do help for software RAIDs, if you want fast rebuild/resync times. >> Getting, for example, a four core/eight thread Intel Xeon Gold 5222 >> might be useful. >> >> Now, the question remains, if AMD processors could make it up with >> higher performance, or better optimized code, or if AVX512 instructions >> are a must, >> >> [=E2=80=A6] >> >> >> Kind regards, >> >> Paul >> >> >> PS: Here are the commands on the AMD EPYC system: >> >> ``` >> $ for i in $(seq 1 16); do truncate -s 10G /dev/shm/vdisk$i.img; done >> $ for i in /dev/shm/v*.img; do sudo losetup --find --show $i; done >> /dev/loop0 >> /dev/loop1 >> /dev/loop2 >> /dev/loop3 >> /dev/loop4 >> /dev/loop5 >> /dev/loop6 >> /dev/loop7 >> /dev/loop8 >> /dev/loop9 >> /dev/loop10 >> /dev/loop11 >> /dev/loop12 >> /dev/loop13 >> /dev/loop14 >> /dev/loop15 >> $ sudo mdadm --create /dev/md1 --level=3D6 --raid-devices=3D16 >> /dev/loop{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} >> mdadm: Defaulting to version 1.2 metadata >> mdadm: array /dev/md1 started. >> $ more /proc/mdstat >> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] >> [multipath] >> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] >> loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5]266 >> loop4[4] loop3[3] lo >> op2[2] loop1[1] loop0[0] >> 146671616 blocks super 1.2 level 6, 512k chunk, algorithm 276 >> [16/16] [UUUUUUUUUUUUUUUU] >> [>....................] resync =3D 3.9% (416880/10476544) >> finish=3D5.6min speed=3D29777K/sec >> >> unused devices: <none> >> $ more /proc/mdstat >> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] >> [multipath] >> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12] >> loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5] >> loop4[4] loop3[3] lo >> op2[2] loop1[1] loop0[0] >> 146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2 >> [16/16] [UUUUUUUUUUUUUUUU] >> [>....................] resync =3D 4.1% (439872/10476544) >> finish=3D5.3min speed=3D31419K/sec >> $ sudo mdadm -S /dev/md1 >> mdadm: stopped /dev/md1 >> $ sudo losetup -D >> $ sudo rm /dev/shm/vdisk*.img > > > I think you are testing something else. Your speeds are way below > what the raw processor can do. You are probably testing memory > speed/numa arch differences between the 2. > > On the intel arch there are 2 numa nodes total with 4 channels, so the > system has 8 usable channels of bandwidth, but a allocation on a > single numa node will only have 4 channels usable (ddr4-2933) > > On the epyc there are 8 numa nodes with 2 channels each (ddr4-2666), > so any single memory allocation will have only 2 channels available > and if the accesses are across the numa bus will be slower. > > So 4*2933/2*2666 =3D 2.20 * 34671 =3D 76286 (fairly close to your results). > > How the allocation for memory works depends a lot on how much ram you > actually have per numa node and how much for the whole machine. But > any single block for any single device should be on a single numa node > almost all of the time. > > You might want to drop the cache before the test, run numactl > --hardware to see how much memory is free per numa node, then rerun > the test and at the of the test before the stop run numactl --hardware > again to see how it was spread across numa nodes. Even if it spreads > it across multiple numa nodes that may well mean that on the epyc case > you are running with several numa nodes were the main raid processes > are running against remote numa nodes, and because intel only has 2 > then there is a decent chance that it is only running on 1 most of the > time (so no remote memory). I have also seen in benchmarks I have run > on 2P and 4P intel machines that interleaved on a 2P single thread job > is faster than running on a single numa nodes memory (with the process > pinned to a single cpu on one of the numa nodes, memory interleaved > over both), but on a 4P/4numa node machine interleaving slows it down > significantly. And in the default case any single write/read of a > block is likely only on a single numa node so that specific read/write > is constrained by a single numa node bandwidth giving an advantage to > fewer faster/bigger numa nodes and less remote memory. > > Outside of rebooting and forcing the entire machine to interleave I am > not sure how to get shm to interleave. It might be a good enough > test to just force the epyc to interleave and see if the benchmark > result changes in any way. If the result does change repeat on the > intel. Overall for the most part the raid would not be able to use > very many cpu anyway, so a bigger machine with more numa nodes may > slow down the overall rate. I don't think it's a memory issue. I can read from similar /dev/shm setup at ~20GB/s on single EPYC Rome. I've also experienced slow sync behavior on otherwise idle CentOS7/8 systems. Even tried kernel-lt.x86_64 5.4.95-1.el8.elrepo. Setting speed_limit_min helps, but that should not be needed when the system is not doing other I/O or compute. This I first noticed with a RAID6 of SAS3 HDDs being far from fast enough for sync. Also writes vere very bad, with less than 1GB/s for 10-18 disk sets, no matter how many writers. With 16 writers on the described loop setup I get this 'perf top' output (similar on HDDs): 33.92% [kernel] [k] native_queued_spin_lock_slowpath 7.05% [kernel] [k] async_copy_data.isra.61 5.81% [kernel] [k] memcpy 2.44% [kernel] [k] read_tsc 1.65% [kernel] [k] analyse_stripe 1.64% [kernel] [k] native_sched_clock 1.32% [kernel] [k] raid6_avx22_gen_syndrome 1.14% [kernel] [k] generic_make_request_checks 1.11% [kernel] [k] _raw_spin_unlock_irqrestore 1.07% [kernel] [k] native_irq_return_iret 1.06% [kernel] [k] add_stripe_bio 1.03% [kernel] [k] raid5_compute_blocknr 0.99% [kernel] [k] _raw_spin_lock_irq 0.88% [kernel] [k] raid5_compute_sector 0.82% [kernel] [k] select_task_rq_fair 0.81% [kernel] [k] _raw_spin_lock_irqsave 0.71% [kernel] [k] blk_mq_make_request 0.70% [kernel] [k] raid5_get_active_stripe 0.68% [kernel] [k] bio_reset 0.67% [kernel] [k] percpu_counter_add_batch 0.63% [kernel] [k] ktime_get 0.61% [kernel] [k] llist_reverse_order 0.59% [kernel] [k] ops_run_io 0.59% [kernel] [k] release_stripe_plug 0.50% [kernel] [k] raid5_make_request 0.48% [kernel] [k] raid5_release_stripe 0.45% [kernel] [k] loop_queue_work 0.44% [kernel] [k] llist_add_batch 0.42% [kernel] [k] sched_clock_cpu 0.41% [kernel] [k] blk_mq_dispatch_rq_list 0.39% [kernel] [k] default_send_IPI_single_phys 0.38% [kernel] [k] do_iter_readv_writev 0.38% [kernel] [k] bio_endio 0.37% [kernel] [k] md_write_inc I'm not sure if it is expected that 'native_queued_spin_lock_slowpath' is so dominant. Seems to increase when adding more and more writers. BTW RAID0 does not have this issue. Cheers, Hans Henrik