Re: ceph + vmware

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 22 Jul 2016 10:10:28 +0200



    Le 22/07/2016 09:47, Nick Fisk a
      écrit :

    
              From: ceph-users
                  [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
                    Behalf Of Frédéric Nass

                  Sent: 22 July 2016 08:11

                  To: Jake Young <jak3kaj@xxxxxxxxx>; Jan
                  Schermer <jan@xxxxxxxxxxx>

                  Cc: ceph-users@xxxxxxxxxxxxxx

                  Subject: Re:  ceph + vmware
            
          
            Le 20/07/2016 21:20, Jake Young a
              écrit :
          
          
              On Wednesday, July 20, 2016, Jan Schermer <jan@xxxxxxxxxxx>
              wrote:
            
              
                > On 20 Jul 2016, at 18:38, Mike Christie <mchristi@xxxxxxxxxx>
                wrote:

                >

                > On 07/20/2016 03:50 AM, Frédéric Nass wrote:

                >>

                >> Hi Mike,

                >>

                >> Thanks for the update on the RHCS iSCSI target.

                >>

                >> Will RHCS 2.1 iSCSI target be compliant with
                VMWare ESXi client ? (or is

                >> it too early to say / announce).

                >

                > No HA support for sure. We are looking into non HA
                support though.

                >

                >>

                >> Knowing that HA iSCSI target was on the
                roadmap, we chose iSCSI over NFS

                >> so we'll just have to remap RBDs to RHCS
                targets when it's available.

                >>

                >> So we're currently running :

                >>

                >> - 2 LIO iSCSI targets exporting the same RBD
                images. Each iSCSI target

                >> has all VAAI primitives enabled and run the
                same configuration.

                >> - RBD images are mapped on each target using
                the kernel client (so no

                >> RBD cache).

                >> - 6 ESXi. Each ESXi can access to the same LUNs
                through both targets,

                >> but in a failover manner so that each ESXi
                always access the same LUN

                >> through one target at a time.

                >> - LUNs are VMFS datastores and VAAI primitives
                are enabled client side

                >> (except UNMAP as per default).

                >>

                >> Do you see anthing risky regarding this
                configuration ?

                >

                > If you use a application that uses scsi persistent
                reservations then you

                > could run into troubles, because some apps expect
                the reservation info

                > to be on the failover nodes as well as the active
                ones.

                >

                > Depending on the how you do failover and the issue
                that caused the

                > failover, IO could be stuck on the old active node
                and cause data

                > corruption. If the initial active node looses its
                network connectivity

                > and you failover, you have to make sure that the
                initial active node is

                > fenced off and IO stuck on that node will never be
                executed. So do

                > something like add it to the ceph monitor blacklist
                and make sure IO on

                > that node is flushed and failed before
                unblacklisting it.

                >

                
                With iSCSI you can't really do hot failover unless you
                only use synchronous IO.
            
            
              VMware does only use synchronous IO.
                Since the hypervisor can't tell what type of data
                the VMs are writing, all IO is treated as needing to be
                synchronous. 
            
            
              (With any of opensource target
                softwares available).

                Flushing the buffers doesn't really help because you
                don't know what in-flight IO happened before the outage

                and which didn't. You could end with only part of the
                "transaction" written on persistent storage.

                
                If you only use synchronous IO all the way from client
                to the persistent storage shared between

                iSCSI target then all should be fine, otherwise YMMV -
                some people run it like that without realizing

                the dangers and have never had a problem, so it may be
                strictly theoretical, and it all depends on how often
                you need to do the

                failover and what data you are storing - corrupting a
                few images on a gallery site could be fine but
                corrupting

                a large database tablespace is no fun at all.
            
            
              No, it's not. VMFS corruption is
                pretty bad too and there is no fsck for VMFS...
            
            
                Some (non opensource) solutions exist, Solaris
                supposedly does this in some(?) way, maybe some iSCSI
                guru

                can chime tell us what magic they do, but I don't think
                it's possible without client support

                (you essentialy have to do something like transactions
                and replay the last transaction on failover). Maybe

                something can be enabled in protocol to do the iSCSI IO
                synchronous or make it at least wait for some sort of
                ACK from the

                server (which would require some sort of cache mirroring
                between the targets) without making it synchronous all
                the way.
            
            
              This is why the SAN vendors wrote
                their own clients and drivers. It is not possible to
                dynamically make all OS's do what your iSCSI target
                expects. 
            
            
              Something like VMware does the right
                thing pretty much all the time (there are some iSCSI
                initiator bugs in earlier ESXi 5.x).  If you have
                control of your ESXi hosts then attempting to set up HA
                iSCSI targets is possible. 
            
            
              If you have a mixed client
                environment with various versions of Windows connecting
                to the target, you may be better off buying some SAN
                appliances.
            
            
                The one time I had to use it I resorted to simply
                mirroring in via mdraid on the client side over two
                targets sharing the same

                DAS, and this worked fine during testing but never went
                to production in the end.

                
                Jan

                
                >

                >>

                >> Would you recommend LIO or STGT (with rbd
                bs-type) target for ESXi

                >> clients ?

                >

                > I can't say, because I have not used stgt with rbd
                bs-type support enough.
            
            
              For starters, STGT doesn't implement
                VAAI properly and you will need to disable VAAI in ESXi.
            
            
            LIO does seem to implement VAAI
              properly, but performance is not nearly as good as STGT
              even with VAAI's benefits. The assumption for the cause is
              that LIO currently uses kernel rbd mapping and kernel rbd
              performance is not as good as librbd.  
            
               
              I recently did a simple test of
                creating an 80GB eager zeroed disk with STGT (VAAI
                disabled, no rbd client cache) and LIO (VAAI
                enabled) and found that STGT was actually
                slightly faster.
            
            
              I think we're all holding our breath
                waiting for LIO librbd support via TCMU, which seems to
                be right around the corner. That solution will combine
                the performance benefits of librbd with the more
                feature-full LIO iSCSI interface. The lrbd configuration
                tool for LIO from SUSE is pretty cool and it makes
                configuring LIO easier than STGT. 
            
            
            Hi Jake,

            
            Problem we're facing with LIO is that it has ESXs
            disconnecting from vCenter regularly. This is a result from
            the iSCSI datastore becoming unreachable.

            It's happens randomly, last time with almost no VM activity
            at all (only 6 VMs in the lab), but when ESX requested a
            write to '.iormstats.sf' file, which I suppose is related to
            storage I/O Control, but I'm not sure of that.

            
            Setting VMFS3.UseATSForHBOnVMFS5 to 0 didn't help.
            Restarting the LIO target almost instantly solves it.

            
            Any one of you ever encountered this issue with LIO target ?
          Yes,
              this is a current known problem that will hopefully be
              resolved soon. When there is a delay servicing IO, ESXi
              asks the target to cancel the IO, LIO tries to do this,
              but from what I understand, the RBD doesn’t have the API
              to allow LIO to reach into the Ceph cluster and cancel the
              in flight IO. LIO responds back, saying I can’t do this
              and then ESXi asks again. And so LIO and ESXi enter a loop
              forever.
          

    Hi Nick,

    
    Thanks for this explanation.

    
    Are you aware of any workaround or ESXi initiator option to tweak
    (like an I/O timeout value) to avoid that ?

    
    Or does this makes LIO target unusable with ESXi as of now ?

    
    Is STGT also affected or does it respond better with the rbd
    (librbd) backstore ?

    
    Frederic.

  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com