Re: ceph + vmware

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Fri, 15 Jul 2016 09:35:21 +0200

Hi Nick,

yeah i understand the point and message, i wont do it :-)

I just asked me recently how do i test if cache is enabled or not ?

What i found requires a client to be connected to an rbd device. But we
dont have that.

Is there any way to ask ceph server if cache is enabled or not ? Its
disabled by config. But by config the default size and min size of newly
created pool are different from what ceph really does.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 15.07.2016 um 09:32 schrieb Nick Fisk:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Oliver Dzombic
>> Sent: 12 July 2016 20:59
>> To: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  ceph + vmware
>>
>> Hi Jack,
>>
>> thank you!
>>
>> What has reliability to do with rbd_cache = true ?
>>
>> I mean aside of the fact, that if a host powers down, the "flying" data are lost.
> 
> Not reliability, but consistency. As you have touched on the cache is in volatile memory and you have told tgt that your cache is non-volatile, now if you have a crash/power outage....etc, then all the data in the cache will be lost. This will likely leave your RBD full of holes or out of date data.
> 
> If you plan to run HA then this is even more important as you could do a write on 1 iscsi target and read the data from another before the cache has flushed. Again corruption, especially if the initiator is doing round robin over the paths.
> 
> Also when you run HA the chance that TGT will failover to the other node because of some timeout you normally don't notice, this will also likely cause serious corruption. 
> 
>>
>> Are there any special limitations / issues with rbd_cache = true and iscsi tgt ?
> 
> I just wouldn't do it. 
> 
> You can almost guarantee data corruption if you do. When librbd gets persistent cache to SSD, this will probably be safe and as long as you can present the cache device to both nodes (eg dual path SAS), HA should be safe as well.
> 
>>
>> --
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:info@xxxxxxxxxxxxxxxxx
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 11.07.2016 um 22:24 schrieb Jake Young:
>>> I'm using this setup with ESXi 5.1 and I get very good performance.  I
>>> suspect you have other issues.  Reliability is another story (see
>>> Nick's posts on tgt and HA to get an idea of the awful problems you
>>> can have), but for my test labs the risk is acceptable.
>>>
>>>
>>> One change I found helpful is to run tgtd with 128 threads.  I'm
>>> running Ubuntu 14.04, so I editted my /etc/init.tgt.conf file and
>>> changed the line that read:
>>>
>>> exec tgtd
>>>
>>> to
>>>
>>> exec tgtd --nr_iothreads=128
>>>
>>>
>>> If you're not concerned with reliability, you can enhance throughput
>>> even more by enabling rbd client write-back cache in your tgt VM's
>>> ceph.conf file (you'll need to restart tgtd for this to take effect):
>>>
>>> [client]
>>> rbd_cache = true
>>> rbd_cache_size = 67108864 # (64MB)
>>> rbd_cache_max_dirty = 50331648 # (48MB) rbd_cache_target_dirty =
>>> 33554432 # (32MB) rbd_cache_max_dirty_age = 2
>>> rbd_cache_writethrough_until_flush = false
>>>
>>>
>>>
>>>
>>> Here's a sample targets.conf:
>>>
>>>   <target iqn.2014-04.tgt.Charter>
>>>   initiator-address ALL
>>>   scsi_sn Charter
>>>   #vendor_id CEPH
>>>   #controller_tid 1
>>>   write-cache on
>>>   read-cache on
>>>   driver iscsi
>>>   bs-type rbd
>>>   <backing-store charter/vmguest>
>>>   lun 5
>>>   scsi_id cfe1000c4a71e700506357
>>>   </backing-store>
>>>   <backing-store charter/voting>
>>>   lun 6
>>>   scsi_id cfe1000c4a71e700507157
>>>   </backing-store>
>>>   <backing-store charter/oradata>
>>>   lun 7
>>>   scsi_id cfe1000c4a71e70050da7a
>>>   </backing-store>
>>>   <backing-store charter/oraback>
>>>   lun 8
>>>   scsi_id cfe1000c4a71e70050bac0
>>>   </backing-store>
>>>   </target>
>>>
>>>
>>>
>>> I don't have FIO numbers handy, but I have some oracle calibrate io
>>> output.
>>>
>>> We're running Oracle RAC database servers in linux VMs on ESXi 5.1,
>>> which use iSCSI to connect to the tgt service.  I only have a single
>>> connection setup in ESXi for each LUN.  I tested using multipathing
>>> and two tgt VMs presenting identical LUNs/RBD disks, but found that
>>> there wasn't a significant performance gain by doing this, even with
>>> round-robin path selecting in VMware.
>>>
>>>
>>> These tests were run from two RAC VMs, each on a different host, with
>>> both hosts connected to the same tgt instance.  The way we have oracle
>>> configured, it would have been using two of the LUNs heavily during
>>> this calibrate IO test.
>>>
>>>
>>> This output is with 128 threads in tgtd and rbd client cache enabled:
>>>
>>> START_TIME           END_TIME               MAX_IOPS   MAX_MBPS  MAX_PMBPS   LATENCY       DISKS
>>> -------------------- -------------------- ---------- ---------- ---------- ---------- ----------
>>> 28-JUN-016 15:10:50  28-JUN-016 15:20:04       14153        658        412       14          75
>>>
>>>
>>> This output is with the same configuration, but with rbd client cache
>>> disabled:
>>>
>>> START_TIME         END_TIME            MAX_IOPS   MAX_MBPS  MAX_PMBPS    LATENCY       DISKS
>>> -------------------- -------------------- ---------- ---------- ---------- ---------- ----------
>>> 28-JUN-016 22:44:29  28-JUN-016 22:49:05    7449        161        219       20          75
>>>
>>> This output is from a directly connected EMC VNX5100 FC SAN with 25
>>> disks using dual 8Gb FC links on a different lab system:
>>>
>>> START_TIME         END_TIME            MAX_IOPS   MAX_MBPS  MAX_PMBPS    LATENCY       DISKS
>>> -------------------- -------------------- ---------- ---------- ---------- ---------- ----------
>>> 28-JUN-016 22:11:25  28-JUN-016 22:18:48    6487        299        224       19          75
>>>
>>>
>>> One of our goals for our Ceph cluster is to replace the EMC SANs.
>>> We've accomplished this performance wise, the next step is to get a
>>> plausible iSCSI HA solution working.  I'm very interested in what Mike
>>> Christie is putting together.  I'm in the process of vetting the SUSE solution now.
>>>
>>> BTW - The tests were run when we had 75 OSDs, which are all 7200RPM
>>> 2TB HDs, across 9 OSD hosts.  We have no SSD journals, instead we have
>>> all the disks setup as single disk RAID1 disk groups with WB cache
>>> with BBU.  All OSD hosts have 40Gb networking and the ESXi hosts have 10G.
>>>
>>> Jake
>>>
>>>
>>> On Mon, Jul 11, 2016 at 12:06 PM, Oliver Dzombic
>>> <info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>> wrote:
>>>
>>>     Hi Mike,
>>>
>>>     i was trying:
>>>
>>>     https://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/
>>>
>>>     ONE target, from different OSD servers directly, to multiple vmware esxi
>>>     servers.
>>>
>>>     A config looked like:
>>>
>>>     #cat iqn.ceph-cluster_netzlaboranten-storage.conf
>>>
>>>     <target iqn.ceph-cluster:vmware-storage>
>>>     driver iscsi
>>>     bs-type rbd
>>>     backing-store rbd/vmware-storage
>>>     initiator-address 10.0.0.9
>>>     initiator-address 10.0.0.10
>>>     incominguser vmwaren-storage RPb18P0xAqkAw4M1
>>>     </target>
>>>
>>>
>>>     We had 4 OSD servers. Everyone had this config running.
>>>     We had 2 vmware servers ( esxi ).
>>>
>>>     So we had 4 paths to this vmware-storage RBD object.
>>>
>>>     VMware, in the very end, had 8 paths ( 4 path's directly connected to
>>>     the specific vmware server ) + 4 paths this specific vmware servers saw
>>>     via the other vmware server ).
>>>
>>>     There were very big problems with performance. I am talking about < 10
>>>     MB/s. So the customer was not able to use it, so good old nfs is
>>>     serving.
>>>
>>>     At that time we used ceph hammer, and i think esxi 5.5 the customer was
>>>     using, or maybe esxi 6, was somewhere last year the testing.
>>>
>>>     --------------------
>>>
>>>     We will make a new attempt now with ceph jewel and esxi 6 and this time
>>>     we will manage the vmware servers.
>>>
>>>     As soon as we fixed this
>>>
>>>     "ceph mon Segmentation fault after set crush_ruleset ceph 10.2.2"
>>>
>>>     what i already mailed here to the list is solved, we can start the
>>>     testing.
>>>
>>>
>>>     --
>>>     Mit freundlichen Gruessen / Best regards
>>>
>>>     Oliver Dzombic
>>>     IP-Interactive
>>>
>>>     mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
>>>
>>>     Anschrift:
>>>
>>>     IP Interactive UG ( haftungsbeschraenkt )
>>>     Zum Sonnenberg 1-3
>>>     63571 Gelnhausen
>>>
>>>     HRB 93402 beim Amtsgericht Hanau
>>>     Geschäftsführung: Oliver Dzombic
>>>
>>>     Steuer Nr.: 35 236 3622 1 <tel:35%20236%203622%201>
>>>     UST ID: DE274086107
>>>
>>>
>>>     Am 11.07.2016 um 17:45 schrieb Mike Christie:
>>>     > On 07/08/2016 02:22 PM, Oliver Dzombic wrote:
>>>     >> Hi,
>>>     >>
>>>     >> does anyone have experience how to connect vmware with ceph smart ?
>>>     >>
>>>     >> iSCSI multipath does not really worked well.
>>>     >
>>>     > Are you trying to export rbd images from multiple iscsi targets at the
>>>     > same time or just one target?
>>>     >
>>>     > For the HA/multiple target setup, I am working on this for Red Hat. We
>>>     > plan to release it in RHEL 7.3/RHCS 2.1. SUSE ships something
>>>     already as
>>>     > someone mentioned.
>>>     >
>>>     > We just got a large chunk of code in the upstream kernel (it is in the
>>>     > block layer maintainer's tree for the next kernel) so it should be
>>>     > simple to add COMPARE_AND_WRITE support now. We should be posting krbd
>>>     > exclusive lock support in the next couple weeks.
>>>     >
>>>     >
>>>     >> NFS could be, but i think thats just too much layers in between
>>>     to have
>>>     >> some useable performance.
>>>     >>
>>>     >> Systems like ScaleIO have developed a vmware addon to talk with it.
>>>     >>
>>>     >> Is there something similar out there for ceph ?
>>>     >>
>>>     >> What are you using ?
>>>     >>
>>>     >> Thank you !
>>>     >>
>>>     >
>>>     _______________________________________________
>>>     ceph-users mailing list
>>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com