ceph-iscsi / tcmu-runner bad pefromance with vmware esxi

José H. Freidhof <harald.freidhof@xxxxxxxxxxxxxx> · Thu, 23 Sep 2021 21:52:34 +0200

Hello together,

i need some help on our ceph 16.2.5 cluster as iscsi target with esxi nodes

background infos:

   - we have build 3x osd nodes with 60 bluestore osd with and 60x6TB
   spinning disks, 12 ssd´s and 3nvme.
   - osd nodes have 32cores and 256gb Ram
   - the osd disk are connected to a scsi raid controller ... each disk is
   configured as raid0 and with write back enabled to use the raid controller
   cache etc.
   - we have 3x mons and 2x iscsi gateways
   - all servers are connected on a 10Gbit network (switches)
   - all servers have two 10gbit network adapter configured as bond-rr
   - we created one rbd pool with autoscaling and 128pg (at the moment)
   - in the pool are at the moment 5 rbd images... 2x 10tb and 3x500gb with
   feature exlusic lock and striping v2 (4mb obj / 1mb stipe / count 4)
   - All the images are attached to the two iscsi gateays running
   tcmu-runner 1.5.4 and exposed as iscsi target
   - we have 6 esxi 6.7u3 servers as computed node connected to the ceph
   iscsi target

esxi iscsi config:
esxcli system settings advanced set -o /ISCSI/MaxIoSizeKB -i 512
esxcli system module parameters set -m iscsi_vmk -p iscsivmk_LunQDepth=64
esxcli system module parameters set -m iscsi_vmk -p iscsivmk_HostQDepth=64
esxcli system settings advanced set --int-value 1 --option
/DataMover/HardwareAcceleratedMove

the osd nodes, mons, rgw/iscsi gateways and esxi nodes are all connected to
the 10gbit network with bond-rr

rbd benchmark test:

root@cd133-ceph-osdh-01:~# rados bench -p rbd 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size
4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_cd133-ceph-osdh-01_87894
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16        69        53   211.987       212    0.250578    0.249261
    2      16       129       113   225.976       240    0.296519    0.266439
    3      16       183       167   222.641       216    0.219422    0.273838
    4      16       237       221   220.974       216    0.469045     0.28091
    5      16       292       276   220.773       220    0.249321     0.27565
    6      16       339       323   215.307       188    0.205553     0.28624
    7      16       390       374   213.688       204    0.188404    0.290426
    8      16       457       441   220.472       268    0.181254    0.286525
    9      16       509       493   219.083       208    0.250538    0.286832
   10      16       568       552   220.772       236    0.307829    0.286076
Total time run:         10.2833
Total writes made:      568
Write size:             4194304
Object size:            4194304
Bandwidth (MB/sec):     220.941
Stddev Bandwidth:       22.295
Max bandwidth (MB/sec): 268
Min bandwidth (MB/sec): 188
Average IOPS:           55
Stddev IOPS:            5.57375
Max IOPS:               67
Min IOPS:               47
Average Latency(s):     0.285903
Stddev Latency(s):      0.115162
Max latency(s):         0.88187
Min latency(s):         0.119276
Cleaning up (deleting benchmark objects)
Removed 568 objects
Clean up completed and total clean up time :3.18627

the rbd benchmark says that min 250 mb/s is possible... but i saw realy
much more... up to 550mb/s

if i start iftop on one osd node i see the ceph iscsi gw names as rgw and
the traffic is nearly 80mb/s
[image: grafik]
<https://user-images.githubusercontent.com/54031716/134509089-2c218b23-7460-4cdb-b54a-e660c91d599e.png>

the ceph dashboard shows that the write iscsi performance are only 40mb/s
the max value i saw was between 40 and 60mb/s.. very poor
[image: grafik]
<https://user-images.githubusercontent.com/54031716/134509280-17c6b4b1-d740-43c9-9b8b-bb77333357a0.png>

if i look into the vcenter and esxi datastore performance i see very high
storage device latencys between 50 and 100ms... very bad
[image: grafik]
<https://user-images.githubusercontent.com/54031716/134509746-c9971592-4129-4f27-a36b-25d50035d437.png>

root@cd133-ceph-mon-01:/home/cephadm# ceph config dump
WHO                                               MASK       LEVEL
OPTION                                       VALUE
                                                                   RO
global                                                       basic
container_image
docker.io/ceph/ceph@sha256:829ebf54704f2d827de00913b171e5da741aad9b53c1f35ad59251524790eceb
 *
global                                                       advanced
journal_max_write_bytes                      1073714824
global                                                       advanced
journal_max_write_entries                    10000
global                                                       advanced
mon_osd_cache_size                           1024
global                                                       dev
osd_client_watch_timeout                     15
global                                                       dev
osd_heartbeat_interval                       5
global                                                       advanced
osd_map_cache_size                           128
global                                                       advanced
osd_max_write_size                           512
global                                                       advanced
rados_osd_op_timeout                         5
global                                                       advanced
rbd_cache_max_dirty                          134217728
global                                                       advanced
rbd_cache_max_dirty_age                      5.000000
global                                                       advanced
rbd_cache_size                               268435456
global                                                       advanced
rbd_op_threads                               2
  mon                                                        advanced
auth_allow_insecure_global_id_reclaim        false
  mon                                                        advanced
cluster_network                              10.50.50.0/24
                                                                   *
  mon                                                        advanced
public_network                               10.50.50.0/24
                                                                   *
  mgr                                                        advanced
mgr/cephadm/container_init                   True
                                                                   *
  mgr                                                        advanced
mgr/cephadm/device_enhanced_scan             true
                                                                   *
  mgr                                                        advanced
mgr/cephadm/migration_current                2
                                                                   *
  mgr                                                        advanced
mgr/cephadm/warn_on_stray_daemons            false
                                                                   *
  mgr                                                        advanced
mgr/cephadm/warn_on_stray_hosts              false
                                                                   *
  mgr                                                        advanced
mgr/dashboard/10.50.50.21/server_addr
                                                                   *
  mgr                                                        advanced
mgr/dashboard/ALERTMANAGER_API_HOST
http://10.221.133.161:9093
                      *
  mgr                                                        advanced
mgr/dashboard/GRAFANA_API_SSL_VERIFY         false
                                                                   *
  mgr                                                        advanced
mgr/dashboard/GRAFANA_API_URL
https://10.221.133.161:3000
                      *
  mgr                                                        advanced
mgr/dashboard/ISCSI_API_SSL_VERIFICATION     true
                                                                   *
  mgr                                                        advanced
mgr/dashboard/NAME/server_port               80
                                                                   *
  mgr                                                        advanced
mgr/dashboard/PROMETHEUS_API_HOST
http://10.221.133.161:9095
                      *
  mgr                                                        advanced
mgr/dashboard/PROMETHEUS_API_SSL_VERIFY      false
                                                                   *
  mgr                                                        advanced
mgr/dashboard/RGW_API_ACCESS_KEY             W8VEKVFDK1RH5IH2Q3GN
                                                                   *
  mgr                                                        advanced
mgr/dashboard/RGW_API_SECRET_KEY
IkIjmjfh3bMLrPOlAFbMfpigSIALAQoKGEHzZgxv
                      *
  mgr                                                        advanced
mgr/dashboard/camdatadash/server_addr        10.251.133.161
                                                                   *
  mgr                                                        advanced
mgr/dashboard/camdatadash/ssl_server_port    8443
                                                                   *
  mgr                                                        advanced
mgr/dashboard/cd133-ceph-mon-01/server_addr
                                                                   *
  mgr                                                        advanced
mgr/dashboard/dasboard/server_port           80
                                                                   *
  mgr                                                        advanced
mgr/dashboard/dashboard/server_addr          10.251.133.161
                                                                   *
  mgr                                                        advanced
mgr/dashboard/dashboard/ssl_server_port      8443
                                                                   *
  mgr                                                        advanced
mgr/dashboard/server_addr                    0.0.0.0
                                                                   *
  mgr                                                        advanced
mgr/dashboard/server_port                    8080
                                                                   *
  mgr                                                        advanced
mgr/dashboard/ssl                            false
                                                                   *
  mgr                                                        advanced
mgr/dashboard/ssl_server_port                8443
                                                                   *
  mgr                                                        advanced
mgr/orchestrator/orchestrator                cephadm
  mgr                                                        advanced
mgr/prometheus/server_addr                   0.0.0.0
                                                                   *
  mgr                                                        advanced
mgr/telemetry/channel_ident                  true
                                                                   *
  mgr                                                        advanced
mgr/telemetry/contact                        hf@xxxxx
                                                              *
  mgr                                                        advanced
mgr/telemetry/description                    ceph cluster
                                                           *
  mgr                                                        advanced
mgr/telemetry/enabled                        true
                                                                   *
  mgr                                                        advanced
mgr/telemetry/last_opt_revision              3
                                                                   *
  osd                                                        dev
bluestore_cache_autotune                     false
  osd                                             class:ssd  dev
bluestore_cache_autotune                     false
  osd                                                        dev
bluestore_cache_size                         4000000000
  osd                                             class:ssd  dev
bluestore_cache_size                         4000000000
  osd                                                        dev
bluestore_cache_size_hdd                     4000000000
  osd                                                        dev
bluestore_cache_size_ssd                     4000000000
  osd                                             class:ssd  dev
bluestore_cache_size_ssd                     4000000000
  osd                                                        advanced
bluestore_default_buffered_write             true
  osd                                             class:ssd  advanced
bluestore_default_buffered_write             true
  osd                                                        advanced
osd_max_backfills                            1
  osd                                             class:ssd  dev
osd_memory_cache_min                         4000000000
  osd                                             class:hdd  basic
osd_memory_target                            6000000000
  osd                                             class:ssd  basic
osd_memory_target                            6000000000
  osd                                                        advanced
osd_recovery_max_active                      3
  osd                                                        advanced
osd_recovery_max_single_start                1
  osd                                                        advanced
osd_recovery_sleep                           0.000000
    client.rgw.ceph-rgw.cd133-ceph-rgw-01.klvrwk             basic
rgw_frontends                                beast port=8000
                                                                   *
    client.rgw.ceph-rgw.cd133-ceph-rgw-01.ptmqcm             basic
rgw_frontends                                beast port=8001
                                                                   *
    client.rgw.ceph-rgw.cd88-ceph-rgw-01.czajah              basic
rgw_frontends                                beast port=8000
                                                                   *
    client.rgw.ceph-rgw.cd88-ceph-rgw-01.pdknfg              basic
rgw_frontends                                beast port=8000
                                                                   *
    client.rgw.ceph-rgw.cd88-ceph-rgw-01.qkdlfl              basic
rgw_frontends                                beast port=8001
                                                                   *
    client.rgw.ceph-rgw.cd88-ceph-rgw-01.tdsxpb              basic
rgw_frontends                                beast port=8001
                                                                   *
    client.rgw.ceph-rgw.cd88-ceph-rgw-01.xnadfr              basic
rgw_frontends                                beast port=8001
                                                                   *

can somebody explain me what i am doing wrong or what can i do to get a
better performance with ceph-iscsi?
doesnt matter what i do or what i tweak the write performance will not get
better.

i already experimented with gwcli and the iscsi queue and other settings.
actually i set:
hw_max_sectors 8192
max_data_area_mb 32
cmdsn_depth 64 / the esxi nodes are alredy set fixed to 64 max iscsi
commands

everything is fine and multipathing is workind and the recovery is fast ...
but the iscsi very slow and i dont know why.
can somebody help me maybe?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx