Re: CephFS+NFS For VMWare

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Thu, 5 Sep 2019 19:14:25 +0200



    this is an old thread, but could be useful for others, i found
      out the discrepancy in VMware vmotion speed under iSCSI is
      probably due the "emulate_3pc" config attribute for the LIO
      target. if set to 0, then yes VMWare will issue io in 64KB blocks,
      so the bandwidth will indeed be around 25 MB/s. If emulate_3pc is
      set to 1, this will trigger VMWare to use vaai extended copy,
      which activates LIO's xcopy functionality which uses 512KB block
      sizes by default. We also bumped the xcopy block size to 4M (rbd
      object size) which gives around 400 MB/s vmotion speed, the same
      speed can also be achieved via Veeam backups.

    
    /Maged

    
    On 02/07/2018 14:36, Maged Mokhtar
      wrote:

    
      Hi Nick,
      With iSCSI we reach over 150 MB/s vmotion for single vm, 1 GB/s
        for 7-8 vm migrations. Since these are 64KB block sizes,
        latency/iops is a large factor, you need either controllers with
        write back cache or all flash . hdds without write cache will
        suffer even with external wal/db on ssds, giving around 80 MB/s
        vmotion migration. Potentially it may be possible to get higher
        vmotion speeds by using fancy striping but i would not recommend
        this unless your total queue depths in all your vms is small
        compared to the number of osds.
      Regarding thin provisioning, a vmdk provisioned as lazy zeroed
        does have an "initial" large impact on random write performance,
        could be up to 10x slower. If you are writing a random 64KB to
        an un-allocated vmfs block, vmfs will first write 1MB to fill
        the block with zeros then write the 64KB client data, so
        although a lot of data is being written the perceived client
        bandwidth is very low. The performance will gradually get better
        with time until the disk is fully provisioned. It is also
        possible to thick eager zero the vmdk disk at creation time.
        Again this is more apparent with random writes rather than
        sequential or vmotion load.
      Maged
      On 2018-06-29 18:48, Nick Fisk wrote:
      
        
          This is for us peeps using Ceph with
            VMWare.
           
          My current favoured solution for
            consuming Ceph in VMWare is via RBD’s formatted with XFS and
            exported via NFS to ESXi. This seems to perform better than
            iSCSI+VMFS which seems to not play nicely with Ceph’s PG
            contention issues particularly if working with thin
            provisioned VMDK’s.
           
          I’ve still been noticing some performance
            issues however, mainly noticeable when doing any form of
            storage migrations. This is largely due to the way vSphere
            transfers VM’s in 64KB IO’s at a QD of 32. vSphere does this
            so Arrays with QOS can balance the IO easier than if larger
            IO’s were submitted. However Ceph’s PG locking means that
            only one or two of these IO’s can happen at a time,
            seriously lowering throughput. Typically you won’t be able
            to push more than 20-25MB/s during a storage migration
           
          There is also another issue in that the
            IO needed for the XFS journal on the RBD, can cause
            contention and effectively also means every NFS write IO
            sends 2 down to Ceph. This can have an impact on latency as
            well. Due to possible PG contention caused by the XFS
            journal updates when multiple IO’s are in flight, you
            normally end up making more and more RBD’s to try and spread
            the load. This normally means you end up having to do
            storage migrations…..you can see where I’m getting at here.
           
          I’ve been thinking for a while that
            CephFS works around a lot of these limitations.
            
           
          1.       It
            supports fancy striping, so should mean there is less per
            object contention
          2.       There
            is no FS in the middle to maintain a journal and other
            associated IO
          3.       A
            single large NFS mount should have none of the disadvantages
            seen with a single RBD
          4.       No
            need to migrate VM’s about because of #3
          5.       No
            need to fstrim after deleting VM’s
          6.       Potential
            to do away with pacemaker and use LVS to do active/active
            NFS as ESXi does its own locking with files
           
          With this in mind I exported a CephFS
            mount via NFS and then mounted it to an ESXi host as a test.
           
          Initial results are looking very good.
            I’m seeing storage migrations to the NFS mount going at over
            200MB/s, which equates to several thousand IO’s and seems to
            be writing at the intended QD32.
           
          I need to do more testing to make sure
            everything works as intended, but like I say, promising
            initial results.
            
           
          Further testing needs to be done to see
            what sort of MDS performance is required, I would imagine
            that since we are mainly dealing with large files, it might
            not be that critical. I also need to consider the stability
            of CephFS, RBD is relatively simple and is in use by a large
            proportion of the Ceph community. CephFS is a lot easier to
            “upset”.
           
          Nick
        
        
        _______________________________________________

          ceph-users mailing list

          ceph-users@xxxxxxxxxxxxxx

          http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
      
      
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx