‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Friday, February 14, 2020 4:49 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote: > On 02/13/2020 09:56 AM, Salsa wrote: > > > I have a 3 hosts, 10 4TB HDDs per host ceph storage set up. I deined a 3 replica rbd pool and some images and presented them to a Vmware host via ISCSI, but the write performance is so bad the I managed to freeze a VM doing a big rsync to a datastore inside ceph and had to reboot it's host (seems I've filled up Vmware's ISCSI queue). > > Right now I'm getting write latencies from 20ms to 80 ms (per OSD) and sometimes peaking at 600 ms (per OSD). > > Client throughput is giving me around 4 MBs. > > How are you testing client throughput? What tool and args? Not testing. This is what my ceph grafana is showing me while I'm writing to the LUN (ISCSI LUN -> VMWARE -> DATASTORE -> VM I/O). > > > Using a 4MB stripe 1 image I got 1.955..359 B/s inside the VM. > > On a 1MB stripe 1 I got 2.323.206 B/s inside the same VM. > > How are you getting the latency and throughput values for iscsi? Is it > esxtop? Were you saying you filled up the vmware iscsi queue based on > the esxtop queue values, and have you increased values like the ESX > queue depth value like here: > > https://docs.vmware.com/en/VMware-vSphere/6.5/com.vmware.vsphere.troubleshooting.doc/GUID-0D774A67-F6AC-4D8A-9E5A-74140F036AD2.html > Latency and throughput also from ceph grafana, but I checked esxtop latency and it is almost the same. About filling the queue, I think I've filled it because the esxi froze and I had to reboot he host (HW). Haven't increased queue as I imagined it would only take longer to fill it up and freeze esxi again. > Note: Sometimes people only increase iscsivmk_LunQDepth, but then forget > to also increase iscsivmk_HostQDepth. > > What is your ceph-iscsi, tcmu-runner and kernel version on the target side? > TCMU-RUNNER 1.5.2 CEPH 14.2.6 Linux ceph01 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux > Here are some general tweaks for the target side: > > 1. Increase the max_data_area_mb (this affects the LUN's max queueing) > and target side queue depth. > > > gwcli > > ====== > > cd /disks > reconfigure rbd/<image_name> max_data_area_mb 128 > > gwcli > > ====== > > cd /iscsi-target/iqn.2003-01.com.redhat.iscsi-gw:ceph-igw > reconfigure iqn.2003-01.com.redhat.iscsi-gw:ceph-igw cmdsn_depth 512 > > 2. Are the VMs on the same iscsi LUN or different ones? > > If on the same LUN then increasing the max_data_area_mb value will help, > because with smaller values we will get lots of qfulls and latency will > be better since IO is not sitting in the target side queue waiting for > memory. On the initaitor side though it is still sometimes helpful for > testing to disable the vmware SIOC and adaptive queueing features. > > 3. Did you check the basics like the multipathing is setup correctly? > > esxcli storage nmp path list -d yourdevice > > shows all the expected paths? > > On the initiator and target side, did you check the logs for any errors > going on when you run your test? > > 4. What test tool are you running and what args? If you just run a > plain fio command like: > > fio --filename=some_new_file --bs=128K --size=5G --name=test > --iodepth=128 --direct=1 --numjobs=8 --rw=read --ioengine=libaio > > what do you get? > > If you run almost the same fio command from the gateway machine or vm, > but use --ioengine=rbd > > fio --bs=128K --size=5G --name=test --iodepth=128 --direct=1 --numjobs=8 > --rw=read --ioengine=rbd --rbdname=your_image --pool=your-pool_rbd > > how does it compare? Will try these tweaks. Only 1 VM so far. Multipathing is correct and I am not using test tools. Real world usage. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx