Tuning CephFS on NVME for HPC / IO500

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear all,

I am currently creating a CephFS setup for a HPC setting. I have a Ceph
v17.2.5 Cluster on Rocky Linux 8.7 (Kernel 4.18.0-425.3.1.el8.x86_64)
deployed with cephadm. I have 10 Ceph nodes with 2x100GbE LAG interconnect
and 36 client nodes with 2x25GbE LAG interconnect. We have Dell NOS10
switches deployed in VLT pairs. Overall, the network topology looks as
follows.

36 clients -- switch pair -- switch pair -- switch-pair -- 10 Ceph nodes

The switch pairs are each connected with 8x100GbE LAG overall. Thus, the
theoretic network limit is ~80GB/sec.

The client nodes also run Rocky Linux 8 and have 2x Intel(R) Xeon(R) Gold
6240R CPU @ 2.40GHz CPUs. The Ceph nodes have 1x AMD EPYC 7413 24-Core
Processor and 250GB of RAM. All processors have hyperthreading enabled. I
have followed the guidance by Croit [1] and done the obvious hardware
tuning (configured the BIOS to make the OS do the power control, setup the
network with MTU 9000. I have deployed 3 MDS per server and have 20 active
overall.

The Ceph cluster nodes have 10x enterprise NVMEs each (all branded as "Dell
enterprise disks"), 8 older nodes (last year) have "Dell Ent NVMe v2 AGN RI
U.2 15.36TB" which are Samsung disks, 2 newer nodes (just delivered) have
"Dell Ent NVMe CM6 RI 15.36TB" which are Kioxia disks. Interestingly, the
Kioxia disks show about 50% higher IOPs in the 4-processor fio test that
Croit suggests.

I'm running the IO500 benchmark with 10 processes each on the clients. I
have pools setup with rep-1, rep-2, rep-3, and EC 8+2 and run the
benchmarks.

So far, I have run "only" short tests with IO500 wall clock time of 30
secs. Good results for me are that I see "ior-easy-write" results of
80GiB/sec so the Ceph cluster is able to saturate the switch network
interconnects. Bad results for me are that I cannot replicate the IO500
results from Red Hat in 2020.

Below are the results that I get on the rep-1 pool.

```
IO500 version io500-sc22_v2 (standard)
[RESULT]       ior-easy-write       78.772830 GiB/s : time 97.098 seconds
[INVALID]
[RESULT]    mdtest-easy-write       37.375945 kIOPS : time 870.934 seconds
[      ]            timestamp        0.000000 kIOPS : time 0.000 seconds
[RESULT]       ior-hard-write        2.242241 GiB/s : time 35.431 seconds
[INVALID]
[RESULT]    mdtest-hard-write        2.575028 kIOPS : time 57.697 seconds
[INVALID]
[RESULT]                 find     1072.770588 kIOPS : time 30.441 seconds
[RESULT]        ior-easy-read       64.118118 GiB/s : time 118.982 seconds
[RESULT]     mdtest-easy-stat      154.903631 kIOPS : time 210.887 seconds
[RESULT]        ior-hard-read        4.285418 GiB/s : time 18.474 seconds
[RESULT]     mdtest-hard-stat       40.126159 kIOPS : time 4.646 seconds
[RESULT]   mdtest-easy-delete       39.296673 kIOPS : time 839.509 seconds
[RESULT]     mdtest-hard-read       17.161306 kIOPS : time 9.505 seconds
[RESULT]   mdtest-hard-delete        4.771440 kIOPS : time 31.931 seconds
[SCORE ] Bandwidth 14.842537 GiB/s : IOPS 34.623082 kiops : TOTAL 22.669239
[INVALID]
```

I wonder whether I missed any tuning parameters or other "secret sauce"
that enabled the results from [2]:

```
[RESULT] BW   phase 1            ior_easy_write               36.255 GiB/s
: time 387.94 seconds
[RESULT] IOPS phase 1         mdtest_easy_write              191.980 kiops
: time 450.05 seconds
[RESULT] BW   phase 2            ior_hard_write                9.137 GiB/s
: time 301.21 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               17.187 kiops
: time 393.55 seconds
[RESULT] IOPS phase 3                      find              965.790 kiops
: time  96.46 seconds
[RESULT] BW   phase 3             ior_easy_read               75.621 GiB/s
: time 185.75 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              903.112 kiops
: time  95.67 seconds
[RESULT] BW   phase 4             ior_hard_read               19.080 GiB/s
: time 144.22 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               97.399 kiops
: time  69.44 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete              123.455 kiops
: time 699.85 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               87.512 kiops
: time  77.29 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete               18.814 kiops
: time 390.91 seconds
[SCORE] Bandwidth 26.2933 GiB/s : IOPS 124.297 kiops : TOTAL 57.168
```

It looks like my results are more in the same order as the SUSE results
from 2019 [3].

```
[RESULT] BW   phase 1            ior_easy_write               16.072 GB/s :
time 347.39 seconds
[RESULT] IOPS phase 1         mdtest_easy_write               32.822 kiops
: time 365.67 seconds
[RESULT] BW   phase 2            ior_hard_write                1.572 GB/s :
time 359.20 seconds
[RESULT] IOPS phase 2         mdtest_hard_write               12.917 kiops
: time 317.70 seconds
[RESULT] IOPS phase 3                      find              250.500 kiops
: time  64.28 seconds
[RESULT] BW   phase 3             ior_easy_read                9.139 GB/s :
time 600.48 seconds
[RESULT] IOPS phase 4          mdtest_easy_stat              127.919 kiops
: time  93.82 seconds
[RESULT] BW   phase 4             ior_hard_read                4.698 GB/s :
time 120.17 seconds
[RESULT] IOPS phase 5          mdtest_hard_stat               68.791 kiops
: time  59.65 seconds
[RESULT] IOPS phase 6        mdtest_easy_delete               20.845 kiops
: time 575.70 seconds
[RESULT] IOPS phase 7          mdtest_hard_read               41.640 kiops
: time  98.55 seconds
[RESULT] IOPS phase 8        mdtest_hard_delete                6.224 kiops
: time 660.50 seconds
[SCORE] Bandwidth 5.73936 GB/s : IOPS 38.7169 kiops : TOTAL 14.9067
```

One difference I could find is that the Red Hat results use the CEPHFS
backend of IO500 (that I cannot get to work properly because of a crash
"Caught signal 11 (Segmentation fault: address not mapped to object at
address (nil))" in libucs.so. SUSE used the POSIX backend.

Changing from 1 OSD server per NVME to 2 did not help too much either.

Maybe someone on the list has an idea for something else to try?

Oh, in case anyone is interested, here are some results using the rep-2,
rep-3, and ec-8-2 pool.

```
                      *** pool=rep-2 NP=360 ***
IO500 version io500-sc22_v2 (standard)
[RESULT]       ior-easy-write       39.613736 GiB/s : time 153.508 seconds
[INVALID]
[RESULT]    mdtest-easy-write       13.932462 kIOPS : time 38.119 seconds
[INVALID]
[      ]            timestamp        0.000000 kIOPS : time 0.000 seconds
[RESULT]       ior-hard-write        1.809117 GiB/s : time 39.019 seconds
[INVALID]
[RESULT]    mdtest-hard-write        4.925225 kIOPS : time 37.654 seconds
[INVALID]
[RESULT]                 find       69.063353 kIOPS : time 9.042 seconds
[RESULT]        ior-easy-read       59.503973 GiB/s : time 102.166 seconds
[RESULT]     mdtest-easy-stat      143.589003 kIOPS : time 4.097 seconds
[RESULT]        ior-hard-read        4.104325 GiB/s : time 14.868 seconds
[RESULT]     mdtest-hard-stat      156.252159 kIOPS : time 2.204 seconds
[RESULT]   mdtest-easy-delete       35.312782 kIOPS : time 14.249 seconds
[RESULT]     mdtest-hard-read       67.097465 kIOPS : time 3.739 seconds
[RESULT]   mdtest-hard-delete       11.018869 kIOPS : time 18.060 seconds
[SCORE ] Bandwidth 11.502045 GiB/s : IOPS 35.927582 kiops : TOTAL 20.328322
[INVALID]


                      *** pool=rep-3 NP=360 ***
IO500 version io500-sc22_v2 (standard)
[RESULT]       ior-easy-write       27.481332 GiB/s : time 204.973 seconds
[INVALID]
[RESULT]    mdtest-easy-write       27.699574 kIOPS : time 1502.596 seconds
[      ]            timestamp        0.000000 kIOPS : time 0.000 seconds
[RESULT]       ior-hard-write        1.352186 GiB/s : time 38.273 seconds
[INVALID]
[RESULT]    mdtest-hard-write        3.024279 kIOPS : time 48.923 seconds
[INVALID]
[RESULT]                 find      777.440295 kIOPS : time 53.684 seconds
[RESULT]        ior-easy-read       58.686272 GiB/s : time 95.992 seconds
[RESULT]     mdtest-easy-stat      156.499256 kIOPS : time 266.755 seconds
[RESULT]        ior-hard-read        4.095575 GiB/s : time 12.649 seconds
[RESULT]     mdtest-hard-stat       62.831560 kIOPS : time 3.318 seconds
[RESULT]   mdtest-easy-delete       25.909017 kIOPS : time 1606.960 seconds
[RESULT]     mdtest-hard-read       16.586529 kIOPS : time 9.735 seconds
[RESULT]   mdtest-hard-delete        9.093536 kIOPS : time 18.615 seconds
[SCORE ] Bandwidth 9.721458 GiB/s : IOPS 35.464915 kiops : TOTAL 18.568002
[INVALID]


                      *** pool=ec-8-2 NP=360 ***
IO500 version io500-sc22_v2 (standard)
[RESULT]       ior-easy-write       40.480456 GiB/s : time 151.451 seconds
[INVALID]
[RESULT]    mdtest-easy-write       32.507690 kIOPS : time 444.424 seconds
[      ]            timestamp        0.000000 kIOPS : time 0.000 seconds
[RESULT]       ior-hard-write        0.570092 GiB/s : time 35.986 seconds
[INVALID]
[RESULT]    mdtest-hard-write        3.287144 kIOPS : time 40.114 seconds
[INVALID]
[RESULT]                 find     1779.068273 kIOPS : time 8.177 seconds
[RESULT]        ior-easy-read       56.463968 GiB/s : time 108.661 seconds
[RESULT]     mdtest-easy-stat      179.334380 kIOPS : time 81.380 seconds
[RESULT]        ior-hard-read        1.957840 GiB/s : time 10.484 seconds
[RESULT]     mdtest-hard-stat       92.430508 kIOPS : time 2.402 seconds
[RESULT]   mdtest-easy-delete       29.549239 kIOPS : time 489.285 seconds
[RESULT]     mdtest-hard-read       26.989114 kIOPS : time 5.770 seconds
[RESULT]   mdtest-hard-delete       26.500674 kIOPS : time 6.038 seconds
[SCORE ] Bandwidth 7.106974 GiB/s : IOPS 53.448254 kiops : TOTAL 19.489879
[INVALID]
```

Best wishes,
Manuel

[1] https://croit.io/blog/ceph-performance-test-and-optimization
[2] https://io500.org/submissions/view/82
[3] https://io500.org/submissions/view/141
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux