Re: Ceph, LIO, VMWARE anyone?

Jake Young <jak3kaj@xxxxxxxxx> · Fri, 23 Jan 2015 11:46:04 -0500

Thanks for the feedback Nick and Zoltan,
I have been seeing periodic kernel panics when I used LIO.  It was either due to LIO or the kernel rbd mapping.  I have seen this on Ubuntu precise with kernel 3.14.14 and again in Ubunty trusty with the utopic kernel (currently 3.16.0-28).  Ironically, this is the primary reason I started exploring a redundancy solution for my iSCSI proxy node.  So, yes, these crashes have nothing to do with running the Active/Active setup.

I am moving my entire setup from LIO to rbd enabled tgt, which I've found to be much more stable and gives equivalent performance.

I've been testing active/active LIO since July of 2014 with VMWare and I've never seen any vmfs corruption.  I am now convinced (thanks Nick) that it is possible.  The reason I have not seen any corruption may have to do with how VMWare happens to be configured.

Originally, I had made a point to use round robin path selection in the VMware hosts; but as I did performance testing, I found that it actually didn't help performance.  When the host switches iSCSI targets there is a short "spin up time" for LIO to get to 100% IO capability.  Since round robin switches targets every 30 seconds (60 seconds? I forget), this seemed to be significant.  A secondary goal for me was to end up with a config that required minimal tuning from VMWare and the target software; so the obvious choice is to leave VMWare's path selection at the default which is Fixed and picks the first target in ASCII-betical order.  That means I am actually functioning in Active/Passive mode.

Jake

On Fri, Jan 23, 2015 at 8:46 AM, Zoltan Arnold Nagy <zoltan@xxxxxxxxxxxxxxxxxx> wrote:

    Just to chime in: it will look fine, feel fine, but underneath it's
    quite easy to get VMFS corruption. Happened in our tests.

    Also if you're running LIO, from time to time expect a kernel panic
    (haven't tried with the latest upstream, as I've been using

    Ubuntu 14.04 on my "export" hosts for the test, so might have
    improved...).

    As of now I would not recommend this setup without being aware of
    the risks involved.

    There have been a few upstream patches getting the LIO code in
    better cluster-aware shape, but no idea if they have been merged

    yet. I know RedHat has a guy on this.

    On 01/21/2015 02:40 PM, Nick Fisk
      wrote:

        Hi
            Jake,

        Thanks
            for this, I have been going through this and have a pretty
            good idea on what you are doing now, however I maybe missing
            something looking through your scripts, but I’m still not
            quite understanding how you are managing to make sure
            locking is happening with the ESXi ATS SCSI command.

        From
            this slide

        http://xo4t.mjt.lu/link/xo4t/gzyhtx3/1/_9gJVMUrSdvzGXYaZfCkVA/aHR0cHM6Ly93aWtpLmNlcGguY29tL0BhcGkvZGVraS9maWxlcy8zOC9oYW1tZXItY2VwaC1kZXZlbC1zdW1taXQtc2NzaS10YXJnZXQtY2x1c3RlcmluZy5wZGY  
            (Page 8)

        It
            seems to indicate that for a true active/active setup the
            two targets need to be aware of each other and exchange
            locking information for it to work reliably, I’ve also
            watched the video from the Ceph developer summit where this
            is discussed and it seems that Ceph+Kernel need changes to
            allow this locking to be pushed back to the RBD layer so it
            can be shared, from what I can see browsing through the
            Linux Git Repo, these patches haven’t made the mainline
            kernel yet.

        Can
            you shed any light on this? As tempting as having
            active/active is, I’m wary about using the configuration
            until I understand how the locking is working and if fringe
            cases involving multiple ESXi hosts writing to the same LUN
            on different targets could spell disaster.

        Many
            thanks,
        Nick

        From: Jake Young [mailto:jak3kaj@xxxxxxxxx] 

            Sent: 14 January 2015 16:54

            To: Nick Fisk

            Cc: Giuseppe Civitella; ceph-users

            Subject: Re:  Ceph, LIO, VMWARE anyone?

            Yes, it's active/active and I found
              that VMWare can switch from path to path with no issues or
              service impact.

          I posted some config files here: github.com/jak3kaj/misc

            One set is from my LIO nodes, both the
              primary and secondary configs so you can see what I needed
              to make unique.  The other set (targets.conf) are from my
              tgt nodes.  They are both 4 LUN configs.

            Like I said in my previous email, there
              is no performance difference between LIO and tgt.  The
              only service I'm running on these nodes is a single iscsi
              target instance (either LIO or tgt).

            Jake

            On Wed, Jan 14, 2015 at 8:41 AM, Nick
              Fisk <nick@xxxxxxxxxx>
              wrote:

                  Hi
                      Jake,

                  I
                      can’t remember the exact details, but it was
                      something to do with a potential problem when
                      using the pacemaker resource agents. I think it
                      was to do with a potential hanging issue when one
                      LUN on a shared target failed and then it tried to
                      kill all the other LUNS to fail the target over to
                      another host. This then leaves the TCM part of LIO
                      locking the RBD which also can’t fail over.

                  That
                      said I did try multiple LUNS on one target as a
                      test and didn’t experience any problems.

                  I’m
                      interested in the way you have your setup
                      configured though. Are you saying you effectively
                      have an active/active configuration with a path
                      going to either host, or are you failing the iSCSI
                      IP between hosts? If it’s the former, have you had
                      any problems with scsi locking/reservations…etc
                      between the two targets?

                  I
                      can see the advantage to that configuration as you
                      reduce/eliminate a lot of the troubles I have had
                      with resources failing over.

                  Nick

                  From: Jake Young [mailto:jak3kaj@xxxxxxxxx]

                      Sent: 14 January 2015 12:50

                      To: Nick Fisk

                      Cc: Giuseppe Civitella; ceph-users

                      Subject: Re:  Ceph, LIO, VMWARE
                      anyone?

                      Nick,

                        Where
                          did you read that having more than 1 LUN per
                          target causes stability problems?

                        I
                          am running 4 LUNs per target. 

                        For
                          HA I'm running two linux iscsi target servers
                          that map the same 4 rbd images. The two
                          targets have the same serial numbers, T10
                          address, etc.  I copy the primary's config to
                          the backup and change IPs. This way VMWare
                          thinks they are different target IPs on the
                          same host. This has worked very well for me. 

                        One
                          suggestion I have is to try using rbd enabled
                          tgt. The performance is equivalent to LIO, but
                          I found it is much better at recovering from a
                          cluster outage. I've had LIO lock up the
                          kernel or simply not recognize that the rbd
                          images are available; where tgt will
                          eventually present the rbd images again. 

                        I
                          have been slowly adding servers and am
                          expanding my test setup to a production setup
                          (nice thing about ceph). I now have 6 OSD
                          hosts with 7 disks on each. I'm using the LSI
                          Nytro cache raid controller, so I don't have a
                          separate journal and have 40Gb networking. I
                          plan to add another 6 OSD hosts in another
                          rack in the next 6 months (and then another 6
                          next year). I'm doing 3x replication, so I
                          want to end up with 3 racks. 

                        Jake

                          On Wednesday, January 14, 2015, Nick Fisk <nick@xxxxxxxxxx>
                          wrote:

                              Hi
                                  Giuseppe,

                              I
                                  am working on something very similar
                                  at the moment. I currently have it
                                  working on some test hardware but
                                  seems to be working reasonably well.

                              I
                                  say reasonably as I have had a few
                                  instability’s but these are on the HA
                                  side, the LIO and RBD side of things
                                  have been rock solid so far. The main
                                  problems I have had seem to be around
                                  recovering from failure with resources
                                  ending up in a unmanaged state. I’m
                                  not currently using fencing so this
                                  may be part of the cause.

                              As
                                  a brief description of my
                                  configuration.

                              4
                                  Hosts each having 2 OSD’s also running
                                  the monitor role
                              3
                                  additional host in a HA cluster which
                                  act as iSCSI proxy nodes.

                              I’m
                                  using the IP, RBD, iSCSITarget and
                                  iSCSILUN resource agents to provide HA
                                  iSCSI LUN which maps back to a RBD.
                                  All the agents for each RBD are in a
                                  group so they follow each other
                                  between hosts.

                              I’m
                                  using 1 LUN per target as I read
                                  somewhere there are stability problems
                                  using more than 1 LUN per target.

                              Performance
                                  seems ok, I can get about 1.2k random
                                  IO’s out the iSCSI LUN. These seems to
                                  be about right for the Ceph cluster
                                  size, so I don’t think the LIO part is
                                  causing any significant overhead.

                              We
                                  should be getting our production
                                  hardware shortly which wil have 40
                                  OSD’s with journals and a SSD caching
                                  tier, so within the next month or so I
                                  will have a better idea of running it
                                  in a production environment and the
                                  performance of the system.

                              Hope
                                  that helps, if you have any questions,
                                  please let me know.

                              Nick

                              From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
                                  On Behalf Of Giuseppe
                                  Civitella

                                  Sent: 13 January 2015 11:23

                                  To: ceph-users

                                  Subject:  Ceph,
                                  LIO, VMWARE anyone?

                                Hi
                                  all,

                                  I'm
                                    working on a lab setup regarding
                                    Ceph serving rbd images as ISCSI
                                    datastores to VMWARE via a LIO box.
                                    Is there someone that already did
                                    something similar wanting to share
                                    some knowledge? Any production
                                    deployments? What about LIO's HA and
                                    luns' performances?

                                  Thanks 

                                  Giuseppe

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com