Re: Unacceptably Poor RAID1 Performance with Many CPU Cores

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jun 15, 2023 at 4:04 PM Ali Gholami Rudi <aligrudi@xxxxxxxxx> wrote:
>
> Hi,
>
> This simple experiment reproduces the problem.
>
> Create a RAID1 array using two ramdisks of size 1G:
>
>   mdadm --create /dev/md/test --level=1 --raid-devices=2 /dev/ram0 /dev/ram1
>
> Then use fio to test disk performance (iodepth=64 and numjobs=40;
> details at the end of this email).  This is what we get in our machine
> (two AMD EPYC 7002 CPUs each with 64 cores and 2TB of RAM; Linux v5.10.0):
>
> Without RAID (writing to /dev/ram0)
> READ:  IOPS=14391K BW=56218MiB/s
> WRITE: IOPS= 6167K BW=24092MiB/s
>
> RAID1 (writing to /dev/md/test)
> READ:  IOPS=  542K BW= 2120MiB/s
> WRITE: IOPS=  232K BW=  935MiB/s
>
> The difference, even for reading is huge.
>
> I tried perf to see what is the problem; results are included at the
> end of this email.
>
> Any ideas?

Hello Ali

Because it can be reproduced easily in your environment. Can you try
with the latest upstream kernel? If the problem doesn't exist with
latest upstream kernel. You can use git bisect to find which patch can
fix this problem.

>
> We are actually executing hundreds of VMs on our hosts.  The problem
> is that when we use RAID1 for our enterprise NVMe disks, the
> performance degrades very much compared to using them directly; it
> seems we have the same bottleneck as the test described above.

So those hundreds VMs run on the raid1, and the raid1 is created with
nvme disks. What's /proc/mdstat?

Regards
Xiao
>
> Thanks,
> Ali
>
> Perf output:
>
> Samples: 1M of event 'cycles', Event count (approx.): 1158425235997
>   Children      Self  Command  Shared Object           Symbol
> +   97.98%     0.01%  fio      fio                     [.] fio_libaio_commit
> +   97.95%     0.01%  fio      libaio.so.1.0.1         [.] io_submit
> +   97.85%     0.01%  fio      [kernel.kallsyms]       [k] __x64_sys_io_submit
> -   97.82%     0.01%  fio      [kernel.kallsyms]       [k] io_submit_one
>    - 97.81% io_submit_one
>       - 54.62% aio_write
>          - 54.60% blkdev_write_iter
>             - 36.30% blk_finish_plug
>                - flush_plug_callbacks
>                   - 36.29% raid1_unplug
>                      - flush_bio_list
>                         - 18.44% submit_bio_noacct
>                            - 18.40% brd_submit_bio
>                               - 18.13% raid1_end_write_request
>                                  - 17.94% raid_end_bio_io
>                                     - 17.82% __wake_up_common_lock
>                                        + 17.79% _raw_spin_lock_irqsave
>                         - 17.79% __wake_up_common_lock
>                            + 17.76% _raw_spin_lock_irqsave
>             + 18.29% __generic_file_write_iter
>       - 43.12% aio_read
>          - 43.07% blkdev_read_iter
>             - generic_file_read_iter
>                - 43.04% blkdev_direct_IO
>                   - 42.95% submit_bio_noacct
>                      - 42.23% brd_submit_bio
>                         - 41.91% raid1_end_read_request
>                            - 41.70% raid_end_bio_io
>                               - 41.43% __wake_up_common_lock
>                                  + 41.36% _raw_spin_lock_irqsave
>                      - 0.68% md_submit_bio
>                           0.61% md_handle_request
> +   94.90%     0.00%  fio      [kernel.kallsyms]       [k] __wake_up_common_lock
> +   94.86%     0.22%  fio      [kernel.kallsyms]       [k] _raw_spin_lock_irqsave
> +   94.64%    94.64%  fio      [kernel.kallsyms]       [k] native_queued_spin_lock_slowpath
> +   79.63%     0.02%  fio      [kernel.kallsyms]       [k] submit_bio_noacct
>
>
> FIO configuration file:
>
> [global]
> name=random reads and writes
> ioengine=libaio
> direct=1
> readwrite=randrw
> rwmixread=70
> iodepth=64
> buffered=0
> #filename=/dev/ram0
> filename=/dev/dm/test
> size=1G
> runtime=30
> time_based
> randrepeat=0
> norandommap
> refill_buffers
> ramp_time=10
> bs=4k
> numjobs=400
> group_reporting=1
> [job1]
>





[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux