Re: Huge lock contention during raid5 build time

Anton Gavriliuk <antosha20xx@xxxxxxxxx> · Thu, 23 Jan 2025 18:56:37 +0200

> What Linux version do you test with?

Currently on Centos Stream 10.

[root@memverge2 ~]# uname -r
6.12.0-43.el10.x86_64

I can switch to the Rocky 9.5 if required.

> I also remember the patch *[RFC V9] md/bitmap:
> Optimize lock contention.* [1]. It’d be great if you could help testing.

Ohh, I have thought that the patch already included in the mdadm
version (mdadm - v4.4-13-ge0df6c4c - 2025-01-17)

If the patch is not yet applied to the latest mdadm version, how
exactly to do that ?

I'm not a Linux developer, but I would be glad to test that patch.

Anyway, I believe that mdadm should be optimized for the latest PCIe
gen 5.0 NVMe SSDs.

Anton

чт, 23 янв. 2025 г. в 16:49, Paul Menzel <pmenzel@xxxxxxxxxxxxx>:
>
> Dear Anton,
>
>
> Thank you for your report.
>
> Am 23.01.25 um 14:56 schrieb Anton Gavriliuk:
>
> > I'm building mdadm raid5 (3+1), based on Intel's NVMe SSD P4600.
> >
> > Mdadm next version
> >
> > [root@memverge2 ~]# /home/anton/mdadm/mdadm --version
> > mdadm - v4.4-13-ge0df6c4c - 2025-01-17
> >
> > Maximum performance I saw ~1.4 GB/s.
> >
> > [root@memverge2 md]# cat /proc/mdstat
> > Personalities : [raid6] [raid5] [raid4]
> > md0 : active raid5 nvme0n1[4] nvme2n1[2] nvme3n1[1] nvme4n1[0]
> >        4688044032 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
> >        [==============>......]  recovery = 71.8% (1122726044/1562681344) finish=5.1min speed=1428101K/sec
> >        bitmap: 0/12 pages [0KB], 65536KB chunk
> >
> > Perf top shows huge spinlock contention
> >
> > Samples: 180K of event 'cycles:P', 4000 Hz, Event count (approx.):
> > 175146370188 lost: 0/0 drop: 0/0
> > Overhead  Shared Object                             Symbol
> >    38.23%  [kernel]                                  [k] native_queued_spin_lock_slowpath
> >     8.33%  [kernel]                                  [k] analyse_stripe
> >     6.85%  [kernel]                                  [k] ops_run_io
> >     3.95%  [kernel]                                  [k] intel_idle_irq
> >     3.41%  [kernel]                                  [k] xor_avx_4
> >     2.76%  [kernel]                                  [k] handle_stripe
> >     2.37%  [kernel]                                  [k] raid5_end_read_request
> >     1.97%  [kernel]                                  [k] find_get_stripe
> >
> > Samples: 1M of event 'cycles:P', 4000 Hz, Event count (approx.): 717038747938
> > native_queued_spin_lock_slowpath  /proc/kcore [Percent: local period]
> > Percent │       testl     %eax,%eax
> >          │     ↑ je        234
> >          │     ↑ jmp       23e
> >     0.00 │248:   shrl      $0x12, %ecx
> >          │       andl      $0x3,%eax
> >     0.00 │       subl      $0x1,%ecx
> >     0.00 │       shlq      $0x5, %rax
> >     0.00 │       movslq    %ecx,%rcx
> >          │       addq      $0x36ec0,%rax
> >     0.01 │       addq      -0x7b67b2a0(,%rcx,8),%rax
> >     0.02 │       movq      %rdx,(%rax)
> >     0.00 │       movl      0x8(%rdx),%eax
> >     0.00 │       testl     %eax,%eax
> >          │     ↓ jne       279
> >    62.27 │270:   pause
> >    17.49 │       movl      0x8(%rdx),%eax
> >     0.00 │       testl     %eax,%eax
> >     1.66 │     ↑ je        270
> >     0.02 │279:   movq      (%rdx),%rcx
> >     0.00 │       testq     %rcx,%rcx
> >          │     ↑ je        202
> >     0.02 │       prefetchw (%rcx)
> >          │     ↑ jmp       202
> >     0.00 │289:   movl      $0x1,%esi
> >     0.02 │       lock
> >          │       cmpxchgl  %esi,(%rbx)
> >          │     ↑ je        129
> >          │     ↑ jmp       20e
> >
> > Are there any plans to optimize spinlock contention ?
> >
> > Latest PCI 5.0 NVMe SSDs have tremendous performance characteristics,
> > but huge spinlock contention just kills that performance.
>
> What Linux version do you test with? A lot of work is going into this in
> the last two years. I also remember the patch *[RFC V9] md/bitmap:
> Optimize lock contention.* [1]. It’d be great if you could help testing.
>
>
> Kind regards,
>
> Paul
>
>
> [1]:
> https://lore.kernel.org/linux-raid/DM6PR12MB319444916C454CDBA6FCD358D83D2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/