CEPH iSCSI issue - ESXi command timeout

Golasowski Martin <martin.golasowski@xxxxxx> · Thu, 1 Oct 2020 12:44:44 +0000

Dear All,

a week ago we had to reboot our ESXi nodes since our CEPH cluster sudennly stopped serving all I/O. We have identified a VM (vCenter appliance) which was swapping heavily and causing heavy load. However, since then we are experiencing strange issues, as if the cluster cannot handle any spike in I/O load like migration or VM reboot.

The main problem is that the iSCSI commands issued by ESXi sometimes time out and ESXi reports inaccessible datastore. It disrupts the I/O heavily, we had to reboot the vmware cluster entirely several times. It started suddennly after approx 10 months of operation without problems.

I can see a steadily increasing number of dropped Rx packets on the iSCSI network interfaces in the OSDs.

Our CEPH setup is following: 4 OSDs, each having 3 10TB 7.2k rpm HDDs. The OSDs are connected by 25 Gbps Ethernet to the other nodes. For the RBD pools I have 64 PGs. The OSDs have 32 GB RAM, free is around 1G on each, I have seen even lower, though. OS is CentOS 7, CEPH release is Nautilus 14.2.11 deployed by ceph-ansible. MONs are virtualized in ESXi nodes on the local SSD drives.

iSCSI NICs are on separate VLAN, other traffic is served via bond with balance-xor (LACP is unusable due to VMware limitation for using SW iSCSI HBA) in a different VLAN. Our network is Mellanox based - SN2100 switches and Connect-X 5 NICs. 

The iSCSI target serves 2 LUNs in RBD pool which is erasure coded. Yesterday I have increased the number of PGs for that pool from 64 to 128, without much effect after the cluster finished rebalancing.

In OSD servers kernel log we see the following:

[299560.618893] iSCSI Login negotiation failed.
[303088.450088] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1994-05.com.redhat:esxi1,i,0x00023d000002,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
[324926.694077] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1994-05.com.redhat:esxi2,i,0x00023d000001,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
[407067.404538] ABORT_TASK: Found referenced iSCSI task_tag: 5891
[407076.077175] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 5891
[411677.887690] ABORT_TASK: Found referenced iSCSI task_tag: 6722
[411683.297425] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6722

The error in ESXi looks like this:

naa.60014053b46fc760ff0470dbd7980263" on path "vmhba64:C1:T0:L0" Failed:
2020-10-01T05:38:51.291Z cpu49:2144076)NMP: nmp_ThrottleLogForDevice:3856: Cmd 0x89 (0x459a5b1b9480, 2097241) to dev "naa.6001405a527d78935724451aa5f53513" on path "vmhba64:C2:T0:L1" Failed:
2020-10-01T05:38:57.098Z cpu44:2099346)NMP: nmp_ThrottleLogForDevice:3856: Cmd 0x8a (0x45ba96710ec0, 2107403) to dev "naa.60014053b46fc760ff0470dbd7980263" on path "vmhba64:C1:T0:L0" Failed:
2020-10-01T05:38:57.122Z cpu71:2098965)NMP: nmp_ThrottleLogForDevice:3856: Cmd 0x89 (0x45ba9676aec0, 2146212) to dev "naa.60014053b46fc760ff0470dbd7980263" on path "vmhba64:C1:T0:L0" Failed:
2020-10-01T05:38:57.256Z cpu65:2098959)NMP: nmp_ThrottleLogForDevice:3856: Cmd 0x89 (0x459a4179d8c0, 2146269) to dev "naa.6001405a527d78935724451aa5f53513" on path "vmhba64:C2:T0:L1" Failed:

We would appreciate any help you can give us.

Thank you very much.

Regards,
Martin Golasowski

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx