Re: ceph iscsi latency too high for esxi?

Martin Verges <martin.verges@xxxxxxxx> · Sun, 4 Oct 2020 15:17:23 +0200

Hello,

no iSCSI + VMware works without such problems.

> We are on latest Nautilus, 12 x 10 TB OSDs (4 servers), 25 Gbit/s
Ethernet, erasure coded rbd pool with 128 PGs, aroun 200 PGs per OSD total.

Nautilus is a good choice
12*10TB HDD is not good for VMs
25Gbit/s on HDD is way to much for that system
200 PGs per OSD is to much, I would suggest 75-100 PGs per OSD

You can improve latency on HDD clusters using external DB/WAL on NVMe. That
might help you

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

Am So., 4. Okt. 2020 um 14:37 Uhr schrieb Golasowski Martin <
martin.golasowski@xxxxxx>:

> Hi,
> does anyone here use CEPH iSCSI with VMware ESXi? It seems that we are
> hitting the 5 second timeout limit on software HBA in ESXi. It appears
> whenever there is increased load on the cluster, like deep scrub or
> rebalance. Is it normal behaviour in production? Or is there something
> special we need to tune?
>
> We are on latest Nautilus, 12 x 10 TB OSDs (4 servers), 25 Gbit/s
> Ethernet, erasure coded rbd pool with 128 PGs, aroun 200 PGs per OSD total.
>
>
> ESXi Log:
>
> 2020-10-04T01:57:04.314Z cpu34:2098959)WARNING: iscsi_vmk:
> iscsivmk_ConnReceiveAtomic:517: vmhba64:CH:1 T:0 CN:0: Failed to receive
> data: Connection closed by peer
> 2020-10-04T01:57:04.314Z cpu34:2098959)iscsi_vmk:
> iscsivmk_ConnRxNotifyFailure:1235: vmhba64:CH:1 T:0 CN:0: Connection rx
> notifying failure: Failed to Receive. State=Bound
> 2020-10-04T01:57:04.566Z cpu19:2098979)WARNING: iscsi_vmk:
> iscsivmk_StopConnection:741: vmhba64:CH:1 T:0 CN:0: iSCSI connection is
> being marked "OFFLINE" (Event:4)
> 2020-10-04T01:57:04.654Z cpu7:2097866)WARNING: VMW_SATP_ALUA:
> satp_alua_issueCommandOnPath:788: Probe cmd 0xa3 failed for path
> "vmhba64:C2:T0:L0" (0x5/0x20/0x0). Check if failover mode is still ALUA.
>
>
> OSD Log:
>
> [303088.450088] Did not receive response to NOPIN on CID: 0, failing
> connection for I_T Nexus
> iqn.1994-05.com.redhat:esxi1,i,0x00023d000002,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
> [324926.694077] Did not receive response to NOPIN on CID: 0, failing
> connection for I_T Nexus
> iqn.1994-05.com.redhat:esxi2,i,0x00023d000001,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
> [407067.404538] ABORT_TASK: Found referenced iSCSI task_tag: 5891
> [407076.077175] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 5891
> [411677.887690] ABORT_TASK: Found referenced iSCSI task_tag: 6722
> [411683.297425] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6722
> [481459.755876] ABORT_TASK: Found referenced iSCSI task_tag: 7930
> [481460.787968] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7930
>
> Cheers,
> Martin_______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx