Re: OT: Processor recommendation for RAID6

Gal Ofri <gal.ofri@xxxxxxxxxx> · Thu, 8 Apr 2021 16:01:31 +0300

On Tue, 6 Apr 2021 23:05:46 +0200
Hans Henrik Happe <happe@xxxxxx> wrote:

> On 02.04.2021 16.45, Roger Heflin wrote:
> > On Fri, Apr 2, 2021 at 4:13 AM Paul Menzel <pmenzel@xxxxxxxxxxxxx> wrote:  
> >>
> >> Dear Linux folks,
> >>
> >>  
> >   
> >>> Are these values a good benchmark for comparing processors?  
> >>
> >> After two years, yes they are. I created 16 10 GB files in `/dev/shm`,
> >> set them up as loop devices, and created a RAID6. For resync speed it
> >> makes difference.
> >>
> >> 2 x AMD EPYC 7601 32-Core Processor:    34671K/sec
> >> 2 x Intel Xeon Gold 6248 CPU @ 2.50GHz: 87533K/sec
> >>
> >> So, the current state of affairs seems to be, that AVX512 instructions
> >> do help for software RAIDs, if you want fast rebuild/resync times.
> >> Getting, for example, a four core/eight thread Intel Xeon Gold 5222
> >> might be useful.
> >>
> >> Now, the question remains, if AMD processors could make it up with
> >> higher performance, or better optimized code, or if AVX512 instructions
> >> are a must,
> >>
> >> [=E2=80=A6]
> >>
> >>
> >> Kind regards,
> >>
> >> Paul
> >>
> >>
> >> PS: Here are the commands on the AMD EPYC system:
> >>
> >> ```
> >> $ for i in $(seq 1 16); do truncate -s 10G /dev/shm/vdisk$i.img; done
> >> $ for i in /dev/shm/v*.img; do sudo losetup --find --show $i; done
> >> /dev/loop0
> >> /dev/loop1
> >> /dev/loop2
> >> /dev/loop3
> >> /dev/loop4
> >> /dev/loop5
> >> /dev/loop6
> >> /dev/loop7
> >> /dev/loop8
> >> /dev/loop9
> >> /dev/loop10
> >> /dev/loop11
> >> /dev/loop12
> >> /dev/loop13
> >> /dev/loop14
> >> /dev/loop15
> >> $ sudo mdadm --create /dev/md1 --level=3D6 --raid-devices=3D16
> >> /dev/loop{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
> >> mdadm: Defaulting to version 1.2 metadata
> >> mdadm: array /dev/md1 started.
> >> $ more /proc/mdstat
> >> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> >> [multipath]
> >> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12]
> >> loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5]266
> >> loop4[4] loop3[3] lo
> >> op2[2] loop1[1] loop0[0]
> >>        146671616 blocks super 1.2 level 6, 512k chunk, algorithm 276
> >> [16/16] [UUUUUUUUUUUUUUUU]  
> >>        [>....................]  resync =3D  3.9% (416880/10476544)  
> >> finish=3D5.6min speed=3D29777K/sec
> >>
> >> unused devices: <none>
> >> $ more /proc/mdstat
> >> Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
> >> [multipath]
> >> md1 : active raid6 loop15[15] loop14[14] loop13[13] loop12[12]
> >> loop11[11] loop10[10] loop9[9] loop8[8] loop7[7] loop6[6] loop5[5]
> >> loop4[4] loop3[3] lo
> >> op2[2] loop1[1] loop0[0]
> >>        146671616 blocks super 1.2 level 6, 512k chunk, algorithm 2
> >> [16/16] [UUUUUUUUUUUUUUUU]  
> >>        [>....................]  resync =3D  4.1% (439872/10476544)  
> >> finish=3D5.3min speed=3D31419K/sec
> >> $ sudo mdadm -S /dev/md1
> >> mdadm: stopped /dev/md1
> >> $ sudo losetup -D
> >> $ sudo rm /dev/shm/vdisk*.img  
> > 
> > 
> > I think you are testing something else.  Your speeds are way below
> > what the raw processor can do. You are probably testing memory
> > speed/numa arch differences between the 2.
> > 
> > On the intel arch there are 2 numa nodes total with 4 channels, so the
> > system  has 8 usable channels of bandwidth, but a allocation on a
> > single numa node will only have 4 channels usable (ddr4-2933)
> > 
> > On the epyc there are 8 numa nodes with 2 channels each (ddr4-2666),
> > so any single memory allocation will have only 2 channels available
> > and if the accesses are across the numa bus will be slower.
> > 
> > So 4*2933/2*2666 =3D 2.20 * 34671 =3D 76286 (fairly close to your results).
> > 
> > How the allocation for memory works depends a lot on how much ram you
> > actually have per numa node and how much for the whole machine.  But
> > any single block for any single device should be on a single numa node
> > almost all of the time.
> > 
> > You might want to drop the cache before the test, run numactl
> > --hardware to see how much memory is free per numa node, then rerun
> > the test and at the of the test before the stop run numactl --hardware
> > again to see how it was spread across numa nodes.  Even if it spreads
> > it across multiple numa nodes that may well mean that on the epyc case
> > you are running with several numa nodes were the main raid processes
> > are running against remote numa nodes, and because intel only has 2
> > then there is a decent chance that it is only running on 1 most of the
> > time (so no remote memory).  I have also seen in benchmarks I have run
> > on 2P and 4P intel machines that interleaved on a 2P single thread job
> > is faster than running on a single numa nodes memory (with the process
> > pinned to a single cpu on one of the numa nodes, memory interleaved
> > over both), but on a 4P/4numa node machine interleaving slows it down
> > significantly.  And in the default case any single write/read of a
> > block is likely only on a single numa node so that specific read/write
> > is constrained by a single numa node bandwidth giving an advantage to
> > fewer faster/bigger numa nodes and less remote memory.
> > 
> > Outside of rebooting and forcing the entire machine to interleave I am
> > not sure how to get shm to interleave.   It might be a good enough
> > test to just force the epyc to interleave and see if the benchmark
> > result changes in any way.  If the result does change repeat on the
> > intel.  Overall for the most part the raid would not be able to use
> > very many cpu anyway, so a bigger machine with more numa nodes may
> > slow down the overall rate.  
> 
> 
> I don't think it's a memory issue. I can read from similar /dev/shm
> setup at ~20GB/s on single EPYC Rome.
> 
> I've also experienced slow sync behavior on otherwise idle CentOS7/8
> systems. Even tried kernel-lt.x86_64 5.4.95-1.el8.elrepo. Setting
> speed_limit_min helps, but that should not be needed when the system is
> not doing other I/O or compute.
> 
> This I first noticed with a RAID6 of SAS3 HDDs being far from fast
> enough for sync. Also writes vere very bad, with less than 1GB/s for
> 10-18 disk sets, no matter how many writers.
> 
> With 16 writers on the described loop setup I get this 'perf top' output
> (similar on HDDs):
> 
>   33.92%  [kernel]                  [k] native_queued_spin_lock_slowpath
> 
> 
> 
>    7.05%  [kernel]                  [k] async_copy_data.isra.61
> 
> 
> 
>    5.81%  [kernel]                  [k] memcpy
> 
> 
> 
>    2.44%  [kernel]                  [k] read_tsc
> 
> 
> 
>    1.65%  [kernel]                  [k] analyse_stripe
> 
> 
> 
>    1.64%  [kernel]                  [k] native_sched_clock
> 
> 
> 
>    1.32%  [kernel]                  [k] raid6_avx22_gen_syndrome
> 
> 
> 
>    1.14%  [kernel]                  [k] generic_make_request_checks
> 
> 
> 
>    1.11%  [kernel]                  [k] _raw_spin_unlock_irqrestore
> 
> 
> 
>    1.07%  [kernel]                  [k] native_irq_return_iret
> 
> 
> 
>    1.06%  [kernel]                  [k] add_stripe_bio
> 
> 
> 
>    1.03%  [kernel]                  [k] raid5_compute_blocknr
> 
> 
> 
>    0.99%  [kernel]                  [k] _raw_spin_lock_irq
> 
> 
> 
>    0.88%  [kernel]                  [k] raid5_compute_sector
> 
> 
> 
>    0.82%  [kernel]                  [k] select_task_rq_fair
> 
> 
> 
>    0.81%  [kernel]                  [k] _raw_spin_lock_irqsave
> 
> 
> 
>    0.71%  [kernel]                  [k] blk_mq_make_request
> 
> 
> 
>    0.70%  [kernel]                  [k] raid5_get_active_stripe
> 
> 
> 
>    0.68%  [kernel]                  [k] bio_reset
> 
> 
> 
>    0.67%  [kernel]                  [k] percpu_counter_add_batch
> 
> 
> 
>    0.63%  [kernel]                  [k] ktime_get
> 
> 
> 
>    0.61%  [kernel]                  [k] llist_reverse_order
> 
> 
> 
>    0.59%  [kernel]                  [k] ops_run_io
> 
> 
> 
>    0.59%  [kernel]                  [k] release_stripe_plug
> 
> 
> 
>    0.50%  [kernel]                  [k] raid5_make_request
> 
> 
> 
>    0.48%  [kernel]                  [k] raid5_release_stripe
> 
> 
> 
>    0.45%  [kernel]                  [k] loop_queue_work
> 
> 
> 
>    0.44%  [kernel]                  [k] llist_add_batch
> 
> 
> 
>    0.42%  [kernel]                  [k] sched_clock_cpu
> 
> 
> 
>    0.41%  [kernel]                  [k] blk_mq_dispatch_rq_list
> 
> 
> 
>    0.39%  [kernel]                  [k] default_send_IPI_single_phys
> 
> 
> 
>    0.38%  [kernel]                  [k] do_iter_readv_writev
> 
> 
> 
>    0.38%  [kernel]                  [k] bio_endio
> 
> 
> 
>    0.37%  [kernel]                  [k] md_write_inc
> 
> 
> 
> 
> I'm not sure if it is expected that 'native_queued_spin_lock_slowpath'
> is so dominant. Seems to increase when adding more and more writers.
> 
> BTW RAID0 does not have this issue.
> 
> Cheers,
> Hans Henrik

Hey,

I'm not sure if that helps much with your processors dilemma, but I'd like to confirm your findings:
I spent a while investigating raid6 (/raid5) performance recently; That spinlock is indeed the bottleneck, especially when you go beyond 8 cpu cores.
I tested some io workloads and found a limit of ~1.6M iops for rand-read-16k and ~100k iops for rand-write-4k, and a limit of 1~2GB throughput on seq writes (e.g. 128k).
* For some reason I get only ~1.2M iops running random-read-4k.

The bottleneck is originated in the device_lock which clearly dominates the code - even in code paths that could, in principle, run in parallel.
Actually, I also spent the time "defusing" that lock, and I have a p.o.c. with ~1.9M/250k iops (rand 4k read/write), and ~25GB throughput on rand-read-16k.
I'm working on making this p.o.c. ready for sharing here, but meanwhile feel free to contact me privately for an unstable version, if you find it useful for your needs.

Will update soon anyway.
Cheers,
Gal Ofri