Re: Ceph, LIO, VMWARE anyone?

Zoltan Arnold Nagy <zoltan@xxxxxxxxxxxxxxxxxx> · Fri, 23 Jan 2015 23:25:55 +0100



    Correct me if I'm wrong, but tgt doesn't have full SCSI-3
    persistence support when _not_ using the LIO

    backend for it, right?

    
    AFAIK you can either run tgt with it's own iSCSI implementation or
    you can use tgt to manage your LIO targets.

    
    I assume when you're running tgt with the rbd backend code you're
    skipping all the in-kernel LIO parts (in which case

    the RedHat patches won't help a bit), and you won't have proper
    active-active support, since the initiators

    have no way to synchronize state (and more importantly, no way to 
    synchronize write caching! [I can think

    of some really ugly hacks to get around that, tho...]).

    
    On 01/23/2015 05:46 PM, Jake Young
      wrote:

    
      Thanks for the feedback Nick and Zoltan,
        

        I have been seeing periodic kernel panics when I used LIO. 
          It was either due to LIO or the kernel rbd mapping.  I have
          seen this on Ubuntu precise with kernel 3.14.14 and again in
          Ubunty trusty with the utopic kernel (currently 3.16.0-28). 
          Ironically, this is the primary reason I started exploring a
          redundancy solution for my iSCSI proxy node.  So, yes, these
          crashes have nothing to do with running the Active/Active
          setup.
        

        I am moving my entire setup from LIO to rbd enabled tgt,
          which I've found to be much more stable and gives equivalent
          performance.
        

        I've been testing active/active LIO since July of 2014 with
          VMWare and I've never seen any vmfs corruption.  I am now
          convinced (thanks Nick) that it is possible.  The reason I
          have not seen any corruption may have to do with how VMWare
          happens to be configured.
        

        Originally, I had made a point to use round robin path
          selection in the VMware hosts; but as I did performance
          testing, I found that it actually didn't help performance. 
          When the host switches iSCSI targets there is a short "spin up
          time" for LIO to get to 100% IO capability.  Since round robin
          switches targets every 30 seconds (60 seconds? I forget), this
          seemed to be significant.  A secondary goal for me was to end
          up with a config that required minimal tuning from VMWare and
          the target software; so the obvious choice is to leave
          VMWare's path selection at the default which is Fixed and
          picks the first target in ASCII-betical order.  That means I
          am actually functioning in Active/Passive mode.
        

        Jake
        

        On Fri, Jan 23, 2015 at 8:46 AM, Zoltan
          Arnold Nagy <zoltan@xxxxxxxxxxxxxxxxxx>
          wrote:

          
             Just to chime in: it
              will look fine, feel fine, but underneath it's quite easy
              to get VMFS corruption. Happened in our tests.

              Also if you're running LIO, from time to time expect a
              kernel panic (haven't tried with the latest upstream, as
              I've been using

              Ubuntu 14.04 on my "export" hosts for the test, so might
              have improved...).

              
              As of now I would not recommend this setup without being
              aware of the risks involved.

              
              There have been a few upstream patches getting the LIO
              code in better cluster-aware shape, but no idea if they
              have been merged

              yet. I know RedHat has a guy on this.

                
                On 01/21/2015 02:40 PM, Nick Fisk wrote:

                
                    Hi

                        Jake,
                     
                    Thanks

                        for this, I have been going through this and
                        have a pretty good idea on what you are doing
                        now, however I maybe missing something looking
                        through your scripts, but I’m still not quite
                        understanding how you are managing to make sure
                        locking is happening with the ESXi ATS SCSI
                        command.
                     
                    From

                        this slide
                     
                  
                  http://xo4t.mjt.lu/link/xo4t/gzyhtx3/1/_9gJVMUrSdvzGXYaZfCkVA/aHR0cHM6Ly93aWtpLmNlcGguY29tL0BhcGkvZGVraS9maWxlcy8zOC9oYW1tZXItY2VwaC1kZXZlbC1zdW1taXQtc2NzaS10YXJnZXQtY2x1c3RlcmluZy5wZGY  

                      (Page 8)
                  
                     
                    It

                        seems to indicate that for a true active/active
                        setup the two targets need to be aware of each
                        other and exchange locking information for it to
                        work reliably, I’ve also watched the video from
                        the Ceph developer summit where this is
                        discussed and it seems that Ceph+Kernel need
                        changes to allow this locking to be pushed back
                        to the RBD layer so it can be shared, from what
                        I can see browsing through the Linux Git Repo,
                        these patches haven’t made the mainline kernel
                        yet.
                     
                    Can

                        you shed any light on this? As tempting as
                        having active/active is, I’m wary about using
                        the configuration until I understand how the
                        locking is working and if fringe cases involving
                        multiple ESXi hosts writing to the same LUN on
                        different targets could spell disaster.
                     
                    Many

                        thanks,
                  
                  Nick
                   
                  From: Jake Young [mailto:jak3kaj@xxxxxxxxx]
                      

                      Sent: 14 January 2015 16:54
                  
                    
                      To: Nick Fisk

                      Cc: Giuseppe Civitella; ceph-users

                      Subject: Re:  Ceph, LIO, VMWARE
                      anyone?
                  
                  
                          Yes, it's active/active
                            and I found that VMWare can switch from path
                            to path with no issues or service impact.
                        
                        
                        I posted some config files
                          here: github.com/jak3kaj/misc
                        
                           
                          One set is from my LIO
                            nodes, both the primary and secondary
                            configs so you can see what I needed to make
                            unique.  The other set (targets.conf) are
                            from my tgt nodes.  They are both 4 LUN
                            configs.
                        
                        
                          Like I said in my
                            previous email, there is no performance
                            difference between LIO and tgt.  The only
                            service I'm running on these nodes is a
                            single iscsi target instance (either LIO or
                            tgt).
                        
                        
                          Jake
                        
                      
                          On Wed, Jan 14, 2015 at
                            8:41 AM, Nick Fisk <nick@xxxxxxxxxx>
                            wrote:
                          
                            
                                Hi

                                    Jake,
                                 
                                I
                                    can’t remember the exact details,
                                    but it was something to do with a
                                    potential problem when using the
                                    pacemaker resource agents. I think
                                    it was to do with a potential
                                    hanging issue when one LUN on a
                                    shared target failed and then it
                                    tried to kill all the other LUNS to
                                    fail the target over to another
                                    host. This then leaves the TCM part
                                    of LIO locking the RBD which also
                                    can’t fail over.
                                 
                                That

                                    said I did try multiple LUNS on one
                                    target as a test and didn’t
                                    experience any problems.
                                 
                                I’m

                                    interested in the way you have your
                                    setup configured though. Are you
                                    saying you effectively have an
                                    active/active configuration with a
                                    path going to either host, or are
                                    you failing the iSCSI IP between
                                    hosts? If it’s the former, have you
                                    had any problems with scsi
                                    locking/reservations…etc between the
                                    two targets?
                                 
                                I
                                    can see the advantage to that
                                    configuration as you
                                    reduce/eliminate a lot of the
                                    troubles I have had with resources
                                    failing over.
                                 
                                Nick
                                 
                                From: Jake Young [mailto:jak3kaj@xxxxxxxxx]
                                    

                                    Sent: 14 January 2015 12:50

                                    To: Nick Fisk

                                    Cc: Giuseppe Civitella;
                                    ceph-users

                                    Subject: Re: 
                                    Ceph, LIO, VMWARE anyone?
                                
                                  
                                    Nick,
                                    
                                       
                                      Where did you
                                        read that having more than 1 LUN
                                        per target causes stability
                                        problems?
                                    
                                    
                                      I am running
                                        4 LUNs per target. 
                                    
                                    
                                      For HA I'm
                                        running two linux iscsi target
                                        servers that map the same 4 rbd
                                        images. The two targets have the
                                        same serial numbers, T10
                                        address, etc.  I copy the
                                        primary's config to the backup
                                        and change IPs. This way VMWare
                                        thinks they are different target
                                        IPs on the same host. This has
                                        worked very well for me. 
                                    
                                    
                                      One
                                        suggestion I have is to try
                                        using rbd enabled tgt. The
                                        performance is equivalent to
                                        LIO, but I found it is much
                                        better at recovering from a
                                        cluster outage. I've had LIO
                                        lock up the kernel or simply not
                                        recognize that the rbd images
                                        are available; where tgt will
                                        eventually present the rbd
                                        images again. 
                                    
                                    
                                      I have been
                                        slowly adding servers and am
                                        expanding my test setup to a
                                        production setup (nice thing
                                        about ceph). I now have 6 OSD
                                        hosts with 7 disks on each. I'm
                                        using the LSI Nytro cache raid
                                        controller, so I don't have a
                                        separate journal and have 40Gb
                                        networking. I plan to add
                                        another 6 OSD hosts in another
                                        rack in the next 6 months (and
                                        then another 6 next year). I'm
                                        doing 3x replication, so I want
                                        to end up with 3 racks. 
                                    
                                    
                                      Jake

                                        
                                        On Wednesday, January 14, 2015,
                                        Nick Fisk <nick@xxxxxxxxxx>

                                        wrote:
                                      
                                        
                                            Hi

                                                Giuseppe,
                                             
                                            I
                                                am working on something
                                                very similar at the
                                                moment. I currently have
                                                it working on some test
                                                hardware but seems to be
                                                working reasonably well.
                                             
                                            I
                                                say reasonably as I have
                                                had a few instability’s
                                                but these are on the HA
                                                side, the LIO and RBD
                                                side of things have been
                                                rock solid so far. The
                                                main problems I have had
                                                seem to be around
                                                recovering from failure
                                                with resources ending up
                                                in a unmanaged state.
                                                I’m not currently using
                                                fencing so this may be
                                                part of the cause.
                                             
                                            As

                                                a brief description of
                                                my configuration.
                                             
                                            4
                                                Hosts each having 2
                                                OSD’s also running the
                                                monitor role
                                            3
                                                additional host in a HA
                                                cluster which act as
                                                iSCSI proxy nodes.
                                             
                                            I’m

                                                using the IP, RBD,
                                                iSCSITarget and iSCSILUN
                                                resource agents to
                                                provide HA iSCSI LUN
                                                which maps back to a
                                                RBD. All the agents for
                                                each RBD are in a group
                                                so they follow each
                                                other between hosts.
                                             
                                            I’m

                                                using 1 LUN per target
                                                as I read somewhere
                                                there are stability
                                                problems using more than
                                                1 LUN per target.
                                             
                                            Performance

                                                seems ok, I can get
                                                about 1.2k random IO’s
                                                out the iSCSI LUN. These
                                                seems to be about right
                                                for the Ceph cluster
                                                size, so I don’t think
                                                the LIO part is causing
                                                any significant
                                                overhead.
                                             
                                            We

                                                should be getting our
                                                production hardware
                                                shortly which wil have
                                                40 OSD’s with journals
                                                and a SSD caching tier,
                                                so within the next month
                                                or so I will have a
                                                better idea of running
                                                it in a production
                                                environment and the
                                                performance of the
                                                system.
                                             
                                            Hope

                                                that helps, if you have
                                                any questions, please
                                                let me know.
                                             
                                            Nick
                                             
                                            From: ceph-users
                                                [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
                                                On Behalf Of Giuseppe

                                                Civitella

                                                Sent: 13 January
                                                2015 11:23

                                                To: ceph-users

                                                Subject:
                                                 Ceph, LIO,
                                                VMWARE anyone?
                                             
                                            
                                              Hi
                                                all,
                                              
                                                 
                                                I'm
                                                  working on a lab setup
                                                  regarding Ceph serving
                                                  rbd images as ISCSI
                                                  datastores to VMWARE
                                                  via a LIO box. Is
                                                  there someone that
                                                  already did something
                                                  similar wanting to
                                                  share some knowledge?
                                                  Any production
                                                  deployments? What
                                                  about LIO's HA and
                                                  luns' performances?
                                              
                                              
                                                Thanks 
                                              
                                              
                                                Giuseppe
                                              
                                            
                _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

              
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com