Re: ceph iscsi latency too high for esxi?

Martin Verges <martin.verges@xxxxxxxx> · Sun, 4 Oct 2020 16:31:00 +0200

Hello,

in my personal opinion, HDDs are a technology from the last century and I
would never ever think about using such old technology for modern
VM/Container/... workloads. My time, as well as any employee is too
precious to wait for a harddrive to find the requested data! Use EC on NVMe
if you need to save some money. It's still much faster with lower latency
than HDDs.

As each HDD only adds like 100 IO/s and 20-30 MB/s to your cluster, you can
throw in 100 Disks and won't even come near the performance of a single
SSD. Yes, each disk will improve your performance, but by such a small
amount that it makes no sense in my eyes.

> Does that mean that occasional iSCSI path drop-outs are somewhat
expected?
Not that I'm aware of, but I have no HDD based ISCSI cluster at hand to
check. Sorry.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

Am So., 4. Okt. 2020 um 16:06 Uhr schrieb Golasowski Martin <
martin.golasowski@xxxxxx>:

> Thanks!
>
> Does that mean that occasional iSCSI path drop-outs are somewhat expected?
> We are using SSDs for WAL/DB on each OSD server, so at least that.
>
> Do you think that If we buy additional 6/12 HDDs would that help with the
> IOPS for the VMs?
>
> Regards,
> Martin
>
>
>
> On 4 Oct 2020, at 15:17, Martin Verges <martin.verges@xxxxxxxx> wrote:
>
> Hello,
>
> no iSCSI + VMware works without such problems.
>
> > We are on latest Nautilus, 12 x 10 TB OSDs (4 servers), 25 Gbit/s
> Ethernet, erasure coded rbd pool with 128 PGs, aroun 200 PGs per OSD total.
>
> Nautilus is a good choice
> 12*10TB HDD is not good for VMs
> 25Gbit/s on HDD is way to much for that system
> 200 PGs per OSD is to much, I would suggest 75-100 PGs per OSD
>
> You can improve latency on HDD clusters using external DB/WAL on NVMe.
> That might help you
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.verges@xxxxxxxx
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
> Am So., 4. Okt. 2020 um 14:37 Uhr schrieb Golasowski Martin <
> martin.golasowski@xxxxxx>:
>
>> Hi,
>> does anyone here use CEPH iSCSI with VMware ESXi? It seems that we are
>> hitting the 5 second timeout limit on software HBA in ESXi. It appears
>> whenever there is increased load on the cluster, like deep scrub or
>> rebalance. Is it normal behaviour in production? Or is there something
>> special we need to tune?
>>
>> We are on latest Nautilus, 12 x 10 TB OSDs (4 servers), 25 Gbit/s
>> Ethernet, erasure coded rbd pool with 128 PGs, aroun 200 PGs per OSD total.
>>
>>
>> ESXi Log:
>>
>> 2020-10-04T01:57:04.314Z cpu34:2098959)WARNING: iscsi_vmk:
>> iscsivmk_ConnReceiveAtomic:517: vmhba64:CH:1 T:0 CN:0: Failed to receive
>> data: Connection closed by peer
>> 2020-10-04T01:57:04.314Z cpu34:2098959)iscsi_vmk:
>> iscsivmk_ConnRxNotifyFailure:1235: vmhba64:CH:1 T:0 CN:0: Connection rx
>> notifying failure: Failed to Receive. State=Bound
>> 2020-10-04T01:57:04.566Z cpu19:2098979)WARNING: iscsi_vmk:
>> iscsivmk_StopConnection:741: vmhba64:CH:1 T:0 CN:0: iSCSI connection is
>> being marked "OFFLINE" (Event:4)
>> 2020-10-04T01:57:04.654Z cpu7:2097866)WARNING: VMW_SATP_ALUA:
>> satp_alua_issueCommandOnPath:788: Probe cmd 0xa3 failed for path
>> "vmhba64:C2:T0:L0" (0x5/0x20/0x0). Check if failover mode is still ALUA.
>>
>>
>> OSD Log:
>>
>> [303088.450088] Did not receive response to NOPIN on CID: 0, failing
>> connection for I_T Nexus
>> iqn.1994-05.com.redhat:esxi1,i,0x00023d000002,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
>> [324926.694077] Did not receive response to NOPIN on CID: 0, failing
>> connection for I_T Nexus
>> iqn.1994-05.com.redhat:esxi2,i,0x00023d000001,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
>> [407067.404538] ABORT_TASK: Found referenced iSCSI task_tag: 5891
>> [407076.077175] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
>> 5891
>> [411677.887690] ABORT_TASK: Found referenced iSCSI task_tag: 6722
>> [411683.297425] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
>> 6722
>> [481459.755876] ABORT_TASK: Found referenced iSCSI task_tag: 7930
>> [481460.787968] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag:
>> 7930
>>
>> Cheers,
>> Martin_______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx