Re: ceph + vmware

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Sat, 16 Jul 2016 10:33:59 +0200

Hi Jake,

thank you very much both was needed, MTU and VAAI deactivated ( i hope
that wont interfere with vmotion or other features ).

I changed now the MTU of vmkernel and vswitch. That solved this problem.

So i could make an ext4 filesystem and mount it.

Running

dd if=/dev/zero of=/mnt/8G_test bs=4k count=2M conv=fdatasync

Something is strange to me:

The network gets streight 1 Gbit ( maximum connection ) of iscsi bandwidth.

But inside the vm i can only see 40-50MB/s.

I mean replicationsize is 2. So it would be easy to say 1/2 of 1 Gbit =
500 Mbit = 40-50MB/s.

But should this reduction not be inside of the ceph cluster ? Which is
going with 10G network ?

I mean the data are hitting with 1 Gbit the ceph iscsi server. So now
this is transported to RBD internally by tgt.
And there its multiplied by 2 ( over the  cluster network which is 10G )
before the ACK is sended back to iscsi. So the cluster will internally
duplicate it via 10G. So my expected bandwidth inside the vm should be
higher than half of the maximum speed.

Is this a wrong understanding of the mechanism ?

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 16.07.2016 um 02:18 schrieb Jake Young:
> I had some odd issues like that due to MTU mismatch. 
> 
> Keep in mind that the vSwitch and vmkernel port have independent MTU
> settings.  Verify you can ping with large size packets without
> fragmentation between your host and iscsi target. 
> 
> If that's not it, you can try to disable VAAI options to see if one of
> them is causing issues. I haven't used ESXi 6.0 yet. 
> 
> Jake
> 
> 
> On Friday, July 15, 2016, Oliver Dzombic <info@xxxxxxxxxxxxxxxxx
> <mailto:info@xxxxxxxxxxxxxxxxx>> wrote:
> 
>     Hi,
> 
>     i am currently trying out the stuff.
> 
>     My tgt config:
> 
>     # cat tgtd.conf
>     # The default config file
>     include /etc/tgt/targets.conf
> 
>     # Config files from other packages etc.
>     include /etc/tgt/conf.d/*.conf
> 
>     nr_iothreads=128
> 
> 
>     -----
> 
>     # cat iqn.2016-07.tgt.esxi-test.conf
>     <target iqn.2016-07.tgt.esxi-test>
>       initiator-address ALL
>       scsi_sn esxi-test
>       #vendor_id CEPH
>       #controller_tid 1
>       write-cache on
>       read-cache on
>       driver iscsi
>       bs-type rbd
>       <backing-store vmware1/esxi-test>
>       lun 1
>       scsi_id cf10000c4a71e700506357
>       </backing-store>
>       </target>
> 
> 
>     --------------
> 
> 
>     If i create a vm inside esxi 6 and try to format the virtual hdd, i see
>     in logs:
> 
>     sd:2:0:0:0: [sda] CDB:
>     Write(10): 2a 00 0f 86 a8 80 00 01 40 00
>     mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff880068aa5e00)
>     mptscsih: ioc0: attempting task abort! ( sc=ffff880068aa4a80)
> 
>     With the LSI HDD emulation. With the vmware paravirtualization
>     everything just freeze.
> 
>     Any idea with that issue ?
> 
>     --
>     Mit freundlichen Gruessen / Best regards
> 
>     Oliver Dzombic
>     IP-Interactive
> 
>     mailto:info@xxxxxxxxxxxxxxxxx
> 
>     Anschrift:
> 
>     IP Interactive UG ( haftungsbeschraenkt )
>     Zum Sonnenberg 1-3
>     63571 Gelnhausen
> 
>     HRB 93402 beim Amtsgericht Hanau
>     Geschäftsführung: Oliver Dzombic
> 
>     Steuer Nr.: 35 236 3622 1
>     UST ID: DE274086107
> 
> 
>     Am 11.07.2016 um 22:24 schrieb Jake Young:
>     > I'm using this setup with ESXi 5.1 and I get very good performance.  I
>     > suspect you have other issues.  Reliability is another story (see
>     Nick's
>     > posts on tgt and HA to get an idea of the awful problems you can
>     have),
>     > but for my test labs the risk is acceptable.
>     >
>     >
>     > One change I found helpful is to run tgtd with 128 threads.  I'm
>     running
>     > Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and changed the
>     > line that read:
>     >
>     > exec tgtd
>     >
>     > to
>     >
>     > exec tgtd --nr_iothreads=128
>     >
>     >
>     > If you're not concerned with reliability, you can enhance throughput
>     > even more by enabling rbd client write-back cache in your tgt VM's
>     > ceph.conf file (you'll need to restart tgtd for this to take effect):
>     >
>     > [client]
>     > rbd_cache = true
>     > rbd_cache_size = 67108864 # (64MB)
>     > rbd_cache_max_dirty = 50331648 # (48MB)
>     > rbd_cache_target_dirty = 33554432 # (32MB)
>     > rbd_cache_max_dirty_age = 2
>     > rbd_cache_writethrough_until_flush = false
>     >
>     >
>     >
>     >
>     > Here's a sample targets.conf:
>     >
>     >   <target iqn.2014-04.tgt.Charter>
>     >   initiator-address ALL
>     >   scsi_sn Charter
>     >   #vendor_id CEPH
>     >   #controller_tid 1
>     >   write-cache on
>     >   read-cache on
>     >   driver iscsi
>     >   bs-type rbd
>     >   <backing-store charter/vmguest>
>     >   lun 5
>     >   scsi_id cfe1000c4a71e700506357
>     >   </backing-store>
>     >   <backing-store charter/voting>
>     >   lun 6
>     >   scsi_id cfe1000c4a71e700507157
>     >   </backing-store>
>     >   <backing-store charter/oradata>
>     >   lun 7
>     >   scsi_id cfe1000c4a71e70050da7a
>     >   </backing-store>
>     >   <backing-store charter/oraback>
>     >   lun 8
>     >   scsi_id cfe1000c4a71e70050bac0
>     >   </backing-store>
>     >   </target>
>     >
>     >
>     >
>     > I don't have FIO numbers handy, but I have some oracle calibrate io
>     > output.
>     >
>     > We're running Oracle RAC database servers in linux VMs on ESXi 5.1,
>     > which use iSCSI to connect to the tgt service.  I only have a single
>     > connection setup in ESXi for each LUN.  I tested using
>     multipathing and
>     > two tgt VMs presenting identical LUNs/RBD disks, but found that there
>     > wasn't a significant performance gain by doing this, even with
>     > round-robin path selecting in VMware.
>     >
>     >
>     > These tests were run from two RAC VMs, each on a different host, with
>     > both hosts connected to the same tgt instance.  The way we have oracle
>     > configured, it would have been using two of the LUNs heavily
>     during this
>     > calibrate IO test.
>     >
>     >
>     > This output is with 128 threads in tgtd and rbd client cache enabled:
>     >
>     > START_TIME           END_TIME               MAX_IOPS   MAX_MBPS 
>     MAX_PMBPS   LATENCY       DISKS
>     > -------------------- -------------------- ---------- ----------
>     ---------- ---------- ----------
>     > 28-JUN-016 15:10:50  28-JUN-016 15:20:04       14153        658   
>         412       14          75
>     >
>     >
>     > This output is with the same configuration, but with rbd client cache
>     > disabled:
>     >
>     > START_TIME         END_TIME            MAX_IOPS   MAX_MBPS 
>     MAX_PMBPS    LATENCY       DISKS
>     > -------------------- -------------------- ---------- ----------
>     ---------- ---------- ----------
>     > 28-JUN-016 22:44:29  28-JUN-016 22:49:05    7449        161       
>     219       20          75
>     >
>     > This output is from a directly connected EMC VNX5100 FC SAN with 25
>     > disks using dual 8Gb FC links on a different lab system:
>     >
>     > START_TIME         END_TIME            MAX_IOPS   MAX_MBPS 
>     MAX_PMBPS    LATENCY       DISKS
>     > -------------------- -------------------- ---------- ----------
>     ---------- ---------- ----------
>     > 28-JUN-016 22:11:25  28-JUN-016 22:18:48    6487        299       
>     224       19          75
>     >
>     >
>     > One of our goals for our Ceph cluster is to replace the EMC SANs. 
>     We've
>     > accomplished this performance wise, the next step is to get a
>     plausible
>     > iSCSI HA solution working.  I'm very interested in what Mike
>     Christie is
>     > putting together.  I'm in the process of vetting the SUSE solution
>     now.
>     >
>     > BTW - The tests were run when we had 75 OSDs, which are all
>     7200RPM 2TB
>     > HDs, across 9 OSD hosts.  We have no SSD journals, instead we have all
>     > the disks setup as single disk RAID1 disk groups with WB cache with
>     > BBU.  All OSD hosts have 40Gb networking and the ESXi hosts have 10G.
>     >
>     > Jake
>     >
>     >
>     > On Mon, Jul 11, 2016 at 12:06 PM, Oliver Dzombic
>     <info@xxxxxxxxxxxxxxxxx
>     > <mailto:info@xxxxxxxxxxxxxxxxx>> wrote:
>     >
>     >     Hi Mike,
>     >
>     >     i was trying:
>     >
>     >     https://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
>     >
>     >     ONE target, from different OSD servers directly, to multiple
>     vmware esxi
>     >     servers.
>     >
>     >     A config looked like:
>     >
>     >     #cat iqn.ceph-cluster_netzlaboranten-storage.conf
>     >
>     >     <target iqn.ceph-cluster:vmware-storage>
>     >     driver iscsi
>     >     bs-type rbd
>     >     backing-store rbd/vmware-storage
>     >     initiator-address 10.0.0.9
>     >     initiator-address 10.0.0.10
>     >     incominguser vmwaren-storage RPb18P0xAqkAw4M1
>     >     </target>
>     >
>     >
>     >     We had 4 OSD servers. Everyone had this config running.
>     >     We had 2 vmware servers ( esxi ).
>     >
>     >     So we had 4 paths to this vmware-storage RBD object.
>     >
>     >     VMware, in the very end, had 8 paths ( 4 path's directly
>     connected to
>     >     the specific vmware server ) + 4 paths this specific vmware
>     servers saw
>     >     via the other vmware server ).
>     >
>     >     There were very big problems with performance. I am talking
>     about < 10
>     >     MB/s. So the customer was not able to use it, so good old nfs is
>     >     serving.
>     >
>     >     At that time we used ceph hammer, and i think esxi 5.5 the
>     customer was
>     >     using, or maybe esxi 6, was somewhere last year the testing.
>     >
>     >     --------------------
>     >
>     >     We will make a new attempt now with ceph jewel and esxi 6 and
>     this time
>     >     we will manage the vmware servers.
>     >
>     >     As soon as we fixed this
>     >
>     >     "ceph mon Segmentation fault after set crush_ruleset ceph 10.2.2"
>     >
>     >     what i already mailed here to the list is solved, we can start the
>     >     testing.
>     >
>     >
>     >     --
>     >     Mit freundlichen Gruessen / Best regards
>     >
>     >     Oliver Dzombic
>     >     IP-Interactive
>     >
>     >     mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
>     >
>     >     Anschrift:
>     >
>     >     IP Interactive UG ( haftungsbeschraenkt )
>     >     Zum Sonnenberg 1-3
>     >     63571 Gelnhausen
>     >
>     >     HRB 93402 beim Amtsgericht Hanau
>     >     Geschäftsführung: Oliver Dzombic
>     >
>     >     Steuer Nr.: 35 236 3622 1 <tel:35%20236%203622%201>
>     >     UST ID: DE274086107
>     >
>     >
>     >     Am 11.07.2016 um 17:45 schrieb Mike Christie:
>     >     > On 07/08/2016 02:22 PM, Oliver Dzombic wrote:
>     >     >> Hi,
>     >     >>
>     >     >> does anyone have experience how to connect vmware with ceph
>     smart ?
>     >     >>
>     >     >> iSCSI multipath does not really worked well.
>     >     >
>     >     > Are you trying to export rbd images from multiple iscsi
>     targets at the
>     >     > same time or just one target?
>     >     >
>     >     > For the HA/multiple target setup, I am working on this for
>     Red Hat. We
>     >     > plan to release it in RHEL 7.3/RHCS 2.1. SUSE ships something
>     >     already as
>     >     > someone mentioned.
>     >     >
>     >     > We just got a large chunk of code in the upstream kernel (it
>     is in the
>     >     > block layer maintainer's tree for the next kernel) so it
>     should be
>     >     > simple to add COMPARE_AND_WRITE support now. We should be
>     posting krbd
>     >     > exclusive lock support in the next couple weeks.
>     >     >
>     >     >
>     >     >> NFS could be, but i think thats just too much layers in between
>     >     to have
>     >     >> some useable performance.
>     >     >>
>     >     >> Systems like ScaleIO have developed a vmware addon to talk
>     with it.
>     >     >>
>     >     >> Is there something similar out there for ceph ?
>     >     >>
>     >     >> What are you using ?
>     >     >>
>     >     >> Thank you !
>     >     >>
>     >     >
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
>     >
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com