I will be glad to test any mdadm patches. I believe that mdadm should be improved to match performance of latest Gen5 NVMe SSDs . Anton сб, 15 февр. 2025 г. в 11:25, Shushu Yi <firnyee@xxxxxxxxx>: > > Hi everyone, > > Thanks very much for testing. Is there anything I can do to forward this patch? > > Best, > Shushu Yi > > Anton Gavriliuk <antosha20xx@xxxxxxxxx> 于2025年1月27日周一 20:33写道: >> >> > Yes, group_thread_cnt sometimes (usually?) causes more lock >> > contention and lower performance. >> >> [root@memverge2 anton]# /home/anton/mdadm/mdadm --version >> mdadm - v4.4-15-g21e4efb1 - 2025-01-27 >> >> /home/anton/mdadm/mdadm --create --verbose /dev/md0 --level=5 >> --raid-devices=4 /dev/nvme0n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 >> >> Without group_thread_cnt (with sync_speed_max=3600000 only) it is not >> 1.4 GB/s, 1 GB/s >> >> [root@memverge2 anton]# cat /proc/mdstat >> Personalities : [raid6] [raid5] [raid4] >> md0 : active raid5 nvme4n1[4] nvme3n1[2] nvme2n1[1] nvme0n1[0] >> 4688044032 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] >> [>....................] recovery = 3.2% (50134272/1562681344) >> finish=24.5min speed=1027434K/sec >> bitmap: 0/12 pages [0KB], 65536KB chunk >> >> > Could you please run perf-report on the perf.data? I won't be >> > able to see all the symbols without your kernel. >> >> [root@memverge2 anton]# perf record -g >> ^C[ perf record: Woken up 36 times to write data ] >> [ perf record: Captured and wrote 9.632 MB perf.data (61989 samples) ] >> >> [root@memverge2 anton]# perf report >> >> Samples: 61K of event 'cycles:P', Event count (approx.): 61959249145 >> Children Self Command Shared Object >> Symbol >> + 79.59% 0.00% md0_raid5 [kernel.kallsyms] >> [k] ret_from_fork_asm >> + 79.59% 0.00% md0_raid5 [kernel.kallsyms] >> [k] ret_from_fork >> + 79.59% 0.00% md0_raid5 [kernel.kallsyms] >> [k] kthread >> + 79.59% 0.00% md0_raid5 [kernel.kallsyms] >> [k] md_thread >> + 79.59% 0.06% md0_raid5 [kernel.kallsyms] >> [k] raid5d >> + 74.47% 0.67% md0_raid5 [kernel.kallsyms] >> [k] handle_active_stripes.isra >> + 68.27% 4.84% md0_raid5 [kernel.kallsyms] >> [k] handle_stripe >> + 27.47% 0.11% md0_raid5 [kernel.kallsyms] >> [k] raid_run_ops >> + 27.36% 0.25% md0_raid5 [kernel.kallsyms] >> [k] ops_run_compute5 >> + 27.10% 0.07% md0_raid5 [kernel.kallsyms] >> [k] async_xor_offs >> + 26.42% 0.16% md0_raid5 [kernel.kallsyms] >> [k] do_sync_xor_offs >> + 21.94% 7.87% md0_raid5 [kernel.kallsyms] >> [k] ops_run_io >> + 19.34% 18.19% md0_raid5 [kernel.kallsyms] >> [k] xor_avx_4 >> + 13.35% 0.00% md0_resync [kernel.kallsyms] >> [k] ret_from_fork_asm >> + 13.35% 0.00% md0_resync [kernel.kallsyms] >> [k] ret_from_fork >> + 13.35% 0.00% md0_resync [kernel.kallsyms] >> [k] kthread >> + 13.35% 0.00% md0_resync [kernel.kallsyms] >> [k] md_thread >> + 13.35% 0.55% md0_resync [kernel.kallsyms] >> [k] md_do_sync.cold >> + 12.41% 0.69% md0_resync [kernel.kallsyms] >> [k] raid5_sync_request >> + 12.18% 0.35% md0_raid5 [kernel.kallsyms] >> [k] submit_bio_noacct_nocheck >> + 11.67% 0.54% md0_raid5 [kernel.kallsyms] >> [k] __submit_bio >> + 11.06% 0.79% md0_raid5 [kernel.kallsyms] >> [k] blk_mq_submit_bio >> + 10.76% 9.83% md0_raid5 [kernel.kallsyms] >> [k] analyse_stripe >> + 10.46% 0.29% md0_resync [kernel.kallsyms] >> [k] raid5_get_active_stripe >> + 6.84% 6.49% md0_raid5 [kernel.kallsyms] >> [k] memset_orig >> + 6.59% 0.00% swapper [kernel.kallsyms] >> [k] common_startup_64 >> + 6.59% 0.01% swapper [kernel.kallsyms] >> [k] cpu_startup_entry >> + 6.58% 0.03% swapper [kernel.kallsyms] >> [k] do_idle >> + 6.44% 0.00% swapper [kernel.kallsyms] >> [k] start_secondary ▒ >> + 5.55% 0.01% md0_raid5 [kernel.kallsyms] >> [k] asm_common_interrupt ▒ >> + 5.53% 0.01% md0_raid5 [kernel.kallsyms] >> [k] common_interrupt ▒ >> + 5.45% 0.01% md0_raid5 [kernel.kallsyms] >> [k] blk_add_rq_to_plug ▒ >> + 5.44% 0.02% swapper [kernel.kallsyms] >> [k] cpuidle_idle_call ▒ >> + 5.44% 0.01% md0_raid5 [kernel.kallsyms] >> [k] blk_mq_flush_plug_list ▒ >> + 5.43% 0.17% md0_raid5 [kernel.kallsyms] >> [k] blk_mq_dispatch_plug_list▒ >> + 5.41% 0.01% md0_raid5 [kernel.kallsyms] >> [k] __common_interrupt ▒ >> + 5.40% 0.03% md0_raid5 [kernel.kallsyms] >> [k] handle_edge_irq ▒ >> + 5.32% 0.01% md0_raid5 [kernel.kallsyms] >> [k] handle_irq_event ▒ >> + 5.25% 0.01% md0_raid5 [kernel.kallsyms] >> [k] __handle_irq_event_percpu▒ >> + 5.25% 0.01% md0_raid5 [kernel.kallsyms] >> [k] nvme_irq ▒ >> + 5.18% 0.14% md0_raid5 [kernel.kallsyms] >> [k] blk_mq_insert_requests ▒ >> + 5.15% 0.00% swapper [kernel.kallsyms] >> [k] cpuidle_enter ▒ >> + 5.15% 0.03% swapper [kernel.kallsyms] >> [k] cpuidle_enter_state ▒ >> + 5.05% 1.29% md0_raid5 [kernel.kallsyms] >> [k] release_stripe_list ▒ >> + 5.00% 0.01% md0_raid5 [kernel.kallsyms] >> [k] blk_mq_try_issue_list_dir▒ >> + 4.98% 0.00% md0_raid5 [kernel.kallsyms] >> [k] __blk_mq_issue_directly ▒ >> + 4.97% 0.03% md0_raid5 [kernel.kallsyms] >> [k] nvme_queue_rq ▒ >> + 4.86% 1.03% md0_resync [kernel.kallsyms] >> [k] init_stripe ▒ >> + 4.80% 0.07% md0_raid5 [kernel.kallsyms] >> [k] nvme_prep_rq.part.0 ▒ >> + 4.57% 0.03% md0_raid5 [kernel.kallsyms] >> [k] nvme_map_data >> + 4.17% 0.17% md0_raid5 [kernel.kallsyms] >> [k] blk_mq_end_request_batch ▒ >> + 3.52% 3.49% swapper [kernel.kallsyms] >> [k] poll_idle ▒ >> + 3.51% 1.67% md0_resync [kernel.kallsyms] >> [k] raid5_compute_blocknr ▒ >> + 3.21% 0.47% md0_raid5 [kernel.kallsyms] >> [k] blk_attempt_plug_merge ▒ >> + 3.13% 3.13% md0_resync [kernel.kallsyms] >> [k] find_get_stripe ▒ >> + 3.01% 0.01% md0_raid5 [kernel.kallsyms] >> [k] __dma_map_sg_attrs ▒ >> + 3.01% 0.00% md0_raid5 [kernel.kallsyms] >> [k] dma_map_sgtable ▒ >> + 3.00% 2.37% md0_raid5 [kernel.kallsyms] >> [k] dma_direct_map_sg ▒ >> + 2.59% 2.13% md0_raid5 [kernel.kallsyms] >> [k] do_release_stripe ▒ >> + 2.45% 1.45% md0_raid5 [kernel.kallsyms] >> [k] __get_priority_stripe ▒ >> + 2.15% 2.14% md0_resync [kernel.kallsyms] >> [k] raid5_compute_sector ▒ >> + 2.14% 0.35% md0_raid5 [kernel.kallsyms] >> [k] bio_attempt_back_merge ▒ >> + 1.99% 1.93% md0_raid5 [kernel.kallsyms] >> [k] raid5_end_read_request ▒ >> + 1.38% 1.00% md0_raid5 [kernel.kallsyms] >> [k] fetch_block ▒ >> + 1.35% 0.14% md0_raid5 [kernel.kallsyms] >> [k] release_inactive_stripe_l▒ >> + 1.35% 0.40% md0_raid5 [kernel.kallsyms] >> [k] ll_back_merge_fn ▒ >> + 1.27% 1.25% md0_resync [kernel.kallsyms] >> [k] _raw_spin_lock_irq ▒ >> + 1.11% 1.11% md0_raid5 [kernel.kallsyms] >> [k] llist_reverse_order ▒ >> + 1.10% 0.70% md0_raid5 [kernel.kallsyms] >> [k] raid5_release_stripe ▒ >> + 1.05% 1.00% md0_raid5 [kernel.kallsyms] >> [k] md_wakeup_thread ▒ >> + 1.02% 0.97% swapper [kernel.kallsyms] >> [k] intel_idle_irq ▒ >> + 0.99% 0.01% md0_raid5 [kernel.kallsyms] >> [k] __blk_rq_map_sg ▒ >> + 0.97% 0.84% md0_raid5 [kernel.kallsyms] >> [k] __blk_bios_map_sg ▒ >> + 0.91% 0.04% md0_raid5 [kernel.kallsyms] >> [k] __wake_up ▒ >> + 0.84% 0.74% md0_raid5 [kernel.kallsyms] >> [k] _raw_spin_lock_irqsave >> >> Anton >> >> пт, 24 янв. 2025 г. в 23:48, Song Liu <song@xxxxxxxxxx>: >> > >> > On Fri, Jan 24, 2025 at 12:00 AM Anton Gavriliuk <antosha20xx@xxxxxxxxx> wrote: >> > > >> > > > We need major work to make it faster so that we can keep up with >> > > > the speed of modern SSDs. >> > > >> > > Glad to know that this in your roadmap. >> > > This is very important for storage server solutions, when you can add >> > > ten's NVMe SSDs Gen 4/5 in 2U server. >> > > I'm not a developer, but I can assist you in the testing as much as required. >> > > >> > > > Could you please do a perf-record with '-g' so that we can see >> > > > which call paths hit the lock contention? This will help us >> > > > understand whether Shushu's bitmap optimization can help. >> > > >> > > default raid5 build performance >> > > >> > > [root@memverge2 ~]# cat /proc/mdstat >> > > Personalities : [raid6] [raid5] [raid4] >> > > md0 : active raid5 nvme0n1[4] nvme2n1[2] nvme3n1[1] nvme4n1[0] >> > > 4688044032 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] >> > > [>....................] recovery = 0.3% (5601408/1562681344) >> > > finish=125.0min speed=207459K/sec >> > > bitmap: 0/12 pages [0KB], 65536KB chunk >> > > >> > > after set >> > > >> > > [root@memverge2 md]# echo 8 > group_thread_cnt >> > >> > Yes, group_thread_cnt sometimes (usually?) causes more lock >> > contention and lower performance. >> > >> > > [root@memverge2 md]# echo 3600000 > sync_speed_max >> > > >> > > [root@memverge2 ~]# cat /proc/mdstat >> > > Personalities : [raid6] [raid5] [raid4] >> > > md0 : active raid5 nvme0n1[4] nvme2n1[2] nvme3n1[1] nvme4n1[0] >> > > 4688044032 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] >> > > [=>...................] recovery = 7.9% (124671408/1562681344) >> > > finish=16.6min speed=1435737K/sec >> > > bitmap: 0/12 pages [0KB], 65536KB chunk >> > > >> > > perf.data.gz attached. >> > >> > Could you please run perf-report on the perf.data? I won't be >> > able to see all the symbols without your kernel. >> > >> > Thanks, >> > Song