Re: CEPH 16.2.x: disappointing I/O performance

Zakhar Kirpichenko <zakhar@xxxxxxxxx> · Wed, 6 Oct 2021 00:06:57 +0300

I'm not sure, fio might be showing some bogus values in the summary, I'll
check the readings again tomorrow.

Another thing I noticed is that writes seem bandwidth-limited and don't
scale well with block size and/or number of threads. I.e. one clients
writes at about the same speed regardless of the benchmark settings. A
person on reddit, where I asked this question as well, suggested that in a
replicated pool writes and reads are handled by the primary PG, which would
explain this write bandwidth limit.

/Z

On Tue, 5 Oct 2021, 22:31 Christian Wuerdig, <christian.wuerdig@xxxxxxxxx>
wrote:

> Maybe some info is missing but 7k write IOPs at 4k block size seem fairly
> decent (as you also state) - the bandwidth automatically follows from that
> so not sure what you're expecting?
> I am a bit puzzled though - by my math 7k IOPS at 4k should only be
> 27MiB/sec - not sure how the 120MiB/sec was achieved
> The read benchmark seems in line with 13k IOPS at 4k making around
> 52MiB/sec bandwidth which again is expected.
>
>
> On Wed, 6 Oct 2021 at 04:08, Zakhar Kirpichenko <zakhar@xxxxxxxxx> wrote:
>
>> Hi,
>>
>> I built a CEPH 16.2.x cluster with relatively fast and modern hardware,
>> and
>> its performance is kind of disappointing. I would very much appreciate an
>> advice and/or pointers :-)
>>
>> The hardware is 3 x Supermicro SSG-6029P nodes, each equipped with:
>>
>> 2 x Intel(R) Xeon(R) Gold 5220R CPUs
>> 384 GB RAM
>> 2 x boot drives
>> 2 x 1.6 TB Micron 7300 MTFDHBE1T6TDG drives (DB/WAL)
>> 2 x 6.4 TB Micron 7300 MTFDHBE6T4TDG drives (storage tier)
>> 9 x Toshiba MG06SCA10TE 9TB HDDs, write cache off (storage tier)
>> 2 x Intel XL710 NICs connected to a pair of 40/100GE switches
>>
>> All 3 nodes are running Ubuntu 20.04 LTS with the latest 5.4 kernel,
>> apparmor is disabled, energy-saving features are disabled. The network
>> between the CEPH nodes is 40G, CEPH access network is 40G, the average
>> latencies are < 0.15 ms. I've personally tested the network for
>> throughput,
>> latency and loss, and can tell that it's operating as expected and doesn't
>> exhibit any issues at idle or under load.
>>
>> The CEPH cluster is set up with 2 storage classes, NVME and HDD, with 2
>> smaller NVME drives in each node used as DB/WAL and each HDD allocated .
>> ceph osd tree output:
>>
>> ID   CLASS  WEIGHT     TYPE NAME                STATUS  REWEIGHT  PRI-AFF
>>  -1         288.37488  root default
>> -13         288.37488      datacenter ste
>> -14         288.37488          rack rack01
>>  -7          96.12495              host ceph01
>>   0    hdd    9.38680                  osd.0        up   1.00000  1.00000
>>   1    hdd    9.38680                  osd.1        up   1.00000  1.00000
>>   2    hdd    9.38680                  osd.2        up   1.00000  1.00000
>>   3    hdd    9.38680                  osd.3        up   1.00000  1.00000
>>   4    hdd    9.38680                  osd.4        up   1.00000  1.00000
>>   5    hdd    9.38680                  osd.5        up   1.00000  1.00000
>>   6    hdd    9.38680                  osd.6        up   1.00000  1.00000
>>   7    hdd    9.38680                  osd.7        up   1.00000  1.00000
>>   8    hdd    9.38680                  osd.8        up   1.00000  1.00000
>>   9   nvme    5.82190                  osd.9        up   1.00000  1.00000
>>  10   nvme    5.82190                  osd.10       up   1.00000  1.00000
>> -10          96.12495              host ceph02
>>  11    hdd    9.38680                  osd.11       up   1.00000  1.00000
>>  12    hdd    9.38680                  osd.12       up   1.00000  1.00000
>>  13    hdd    9.38680                  osd.13       up   1.00000  1.00000
>>  14    hdd    9.38680                  osd.14       up   1.00000  1.00000
>>  15    hdd    9.38680                  osd.15       up   1.00000  1.00000
>>  16    hdd    9.38680                  osd.16       up   1.00000  1.00000
>>  17    hdd    9.38680                  osd.17       up   1.00000  1.00000
>>  18    hdd    9.38680                  osd.18       up   1.00000  1.00000
>>  19    hdd    9.38680                  osd.19       up   1.00000  1.00000
>>  20   nvme    5.82190                  osd.20       up   1.00000  1.00000
>>  21   nvme    5.82190                  osd.21       up   1.00000  1.00000
>>  -3          96.12495              host ceph03
>>  22    hdd    9.38680                  osd.22       up   1.00000  1.00000
>>  23    hdd    9.38680                  osd.23       up   1.00000  1.00000
>>  24    hdd    9.38680                  osd.24       up   1.00000  1.00000
>>  25    hdd    9.38680                  osd.25       up   1.00000  1.00000
>>  26    hdd    9.38680                  osd.26       up   1.00000  1.00000
>>  27    hdd    9.38680                  osd.27       up   1.00000  1.00000
>>  28    hdd    9.38680                  osd.28       up   1.00000  1.00000
>>  29    hdd    9.38680                  osd.29       up   1.00000  1.00000
>>  30    hdd    9.38680                  osd.30       up   1.00000  1.00000
>>  31   nvme    5.82190                  osd.31       up   1.00000  1.00000
>>  32   nvme    5.82190                  osd.32       up   1.00000  1.00000
>>
>> ceph df:
>>
>> --- RAW STORAGE ---
>> CLASS     SIZE    AVAIL    USED  RAW USED  %RAW USED
>> hdd    253 TiB  241 TiB  13 TiB    13 TiB       5.00
>> nvme    35 TiB   35 TiB  82 GiB    82 GiB       0.23
>> TOTAL  288 TiB  276 TiB  13 TiB    13 TiB       4.42
>>
>> --- POOLS ---
>> POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX
>> AVAIL
>> images                 12  256   24 GiB    3.15k   73 GiB   0.03     76
>> TiB
>> volumes                13  256  839 GiB  232.16k  2.5 TiB   1.07     76
>> TiB
>> backups                14  256   31 GiB    8.56k   94 GiB   0.04     76
>> TiB
>> vms                    15  256  752 GiB  198.80k  2.2 TiB   0.96     76
>> TiB
>> device_health_metrics  16   32   35 MiB       39  106 MiB      0     76
>> TiB
>> volumes-nvme           17  256   28 GiB    7.21k   81 GiB   0.24     11
>> TiB
>> ec-volumes-meta        18  256   27 KiB        4   92 KiB      0     76
>> TiB
>> ec-volumes-data        19  256    8 KiB        1   12 KiB      0    152
>> TiB
>>
>> Please disregard the ec-pools, as they're not currently in use. All other
>> pools are configured with min_size=2, size=3. All pools are bound to HDD
>> storage except for 'volumes-nvme', which is bound to NVME. The number of
>> PGs was increased recently, as with autoscaler I was getting a very uneven
>> PG distribution on devices and we're expecting to add 3 more nodes of
>> exactly the same configuration in the coming weeks. I have to emphasize
>> that I tested different PG numbers and they didn't have a noticeable
>> impact
>> on the cluster performance.
>>
>> The main issue is that this beautiful cluster isn't very fast. When I test
>> against the 'volumes' pool, residing on HDD storage class (HDDs with
>> DB/WAL
>> on NVME), I get unexpectedly low throughput numbers:
>>
>> > rados -p volumes bench 30 write --no-cleanup
>> ...
>> Total time run:         30.3078
>> Total writes made:      3731
>> Write size:             4194304
>> Object size:            4194304
>> Bandwidth (MB/sec):     492.415
>> Stddev Bandwidth:       161.777
>> Max bandwidth (MB/sec): 820
>> Min bandwidth (MB/sec): 204
>> Average IOPS:           123
>> Stddev IOPS:            40.4442
>> Max IOPS:               205
>> Min IOPS:               51
>> Average Latency(s):     0.129115
>> Stddev Latency(s):      0.143881
>> Max latency(s):         1.35669
>> Min latency(s):         0.0228179
>>
>> > rados -p volumes bench 30 seq --no-cleanup
>> ...
>> Total time run:       14.7272
>> Total reads made:     3731
>> Read size:            4194304
>> Object size:          4194304
>> Bandwidth (MB/sec):   1013.36
>> Average IOPS:         253
>> Stddev IOPS:          63.8709
>> Max IOPS:             323
>> Min IOPS:             91
>> Average Latency(s):   0.0625202
>> Max latency(s):       0.551629
>> Min latency(s):       0.010683
>>
>> On average, I get around 550 MB/s writes and 800 MB/s reads with 16
>> threads
>> and 4MB blocks. The numbers don't look fantastic for this hardware, I can
>> actually push over 8 GB/s of throughput with fio, 16 threads and 4MB
>> blocks
>> from an RBD client (KVM Linux VM) connected over a low-latency 40G
>> network,
>> probably hitting some OSD caches there:
>>
>>    READ: bw=8525MiB/s (8939MB/s), 58.8MiB/s-1009MiB/s (61.7MB/s-1058MB/s),
>> io=501GiB (538GB), run=60001-60153msec
>> Disk stats (read/write):
>>   vdc: ios=48163/0, merge=6027/0, ticks=1400509/0, in_queue=1305092,
>> util=99.48%
>>
>> The issue manifests when the same client does something closer to
>> real-life
>> usage, like a single-thread write or read with 4KB blocks, as if using for
>> example ext4 file system:
>>
>> > fio --name=ttt --ioengine=posixaio --rw=write --bs=4k --numjobs=1
>> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
>> ...
>> Run status group 0 (all jobs):
>>   WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s),
>> io=7694MiB (8067MB), run=64079-64079msec
>> Disk stats (read/write):
>>   vdc: ios=0/6985, merge=0/406, ticks=0/3062535, in_queue=3048216,
>> util=77.31%
>>
>> > fio --name=ttt --ioengine=posixaio --rw=read --bs=4k --numjobs=1
>> --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
>> ...
>> Run status group 0 (all jobs):
>>    READ: bw=54.0MiB/s (56.7MB/s), 54.0MiB/s-54.0MiB/s (56.7MB/s-56.7MB/s),
>> io=3242MiB (3399MB), run=60001-60001msec
>> Disk stats (read/write):
>>   vdc: ios=12952/3, merge=0/1, ticks=81706/1, in_queue=56336, util=99.13%
>>
>> And this is a total disaster: the IOPS look decent, but the bandwidth is
>> unexpectedly very very low. I just don't understand why a single RBD
>> client
>> writes at 120 MB/s (sometimes slower), and 50 MB/s reads look like a bad
>> joke ¯\_(ツ)_/¯
>>
>> When I run these benchmarks, nothing seems to be overloaded, things like
>> CPU and network are barely utilized, OSD latencies don't show anything
>> unusual. Thus I am puzzled with these results, as in my opinion SAS HDDs
>> with DB/WAL on NVME drives should produce better I/O bandwidth, both for
>> writes and reads. I mean, I can easily get much better performance from a
>> single HDD shared over network via NFS or iSCSI.
>>
>> I am open to suggestions and would very much appreciate comments and/or an
>> advice on how to improve the cluster performance.
>>
>> Best regards,
>> Zakhar
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx