Re: All SSD Pool - Odd Performance

Zoltan Arnold Nagy <zoltan@xxxxxxxxxxxxxxxxxx> · Sun, 22 Nov 2015 14:29:22 +0100

It would have been more interesting if you had tweaked only one option as now we can’t be sure which changed had what impact… :-)
On 22 Nov 2015, at 04:29, Udo Lembke <ulembke@xxxxxxxxxxxx> wrote:

    Hi Sean,

    Haomai is right, that qemu can have a huge performance differences.

    I have done two test to the same ceph-cluster (different pools, but
    this should not do any differences).

    One test with proxmox ve 4 (qemu 2.4, iothread for device, and
    cache=writeback) gives 14856 iops

    Same test with proxmox ve 3.4 (qemu 2.2.1, cache=writethrough) gives
    5070 iops only.

    Here the results in long:

    ############### proxmox ve 3.x ###############

    kvm --version

    QEMU emulator version 2.2.1, Copyright (c) 2003-2008 Fabrice Bellard

    VM:

    virtio2:
    ceph_file:vm-405-disk-1,cache=writethrough,backup=no,size=4096G

    root@fileserver:/daten/support/test# fio --time_based
    --name=benchmark --size=4G --filename=/mnt/test.bin
    --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1
    --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
    --rw=randwrite --blocksize=4k --group_reporting

    fio: time_based requires a runtime/timeout setting

    benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
    ioengine=libaio, iodepth=128

    ...

    fio-2.1.11

    Starting 4 processes

    benchmark: Laying out IO file(s) (1 file(s) / 4096MB)

    Jobs: 1 (f=1): [_(1),w(1),_(2)] [100.0% done] [0KB/40024KB/0KB /s]
    [0/10.6K/0 iops] [eta 00m:00s]

    benchmark: (groupid=0, jobs=4): err= 0: pid=7821: Sun Nov 22
    04:07:47 2015

      write: io=16384MB, bw=20282KB/s, iops=5070, runt=827178msec

        slat (usec): min=0, max=2531.7K, avg=778.68, stdev=12757.26

        clat (usec): min=508, max=2755.2K, avg=99980.14, stdev=146967.17

         lat (msec): min=1, max=2755, avg=100.76, stdev=147.54

        clat percentiles (msec):

         |  1.00th=[   10],  5.00th=[   14], 10.00th=[   19],
    20.00th=[   28],

         | 30.00th=[   36], 40.00th=[   43], 50.00th=[   51],
    60.00th=[   63],

         | 70.00th=[   81], 80.00th=[  128], 90.00th=[  237], 95.00th=[ 
    367],

         | 99.00th=[  717], 99.50th=[  889], 99.90th=[ 1516], 99.95th=[
    1713],

         | 99.99th=[ 2573]

        bw (KB  /s): min=    4, max=30726, per=26.90%, avg=5456.84,
    stdev=3014.45

        lat (usec) : 750=0.01%, 1000=0.01%

        lat (msec) : 2=0.01%, 4=0.01%, 10=1.11%, 20=10.18%, 50=37.74%

        lat (msec) : 100=26.45%, 250=15.22%, 500=6.66%, 750=1.74%,
    1000=0.55%

        lat (msec) : 2000=0.29%, >=2000=0.03%

      cpu          : usr=0.36%, sys=2.31%, ctx=1148702, majf=0, minf=30

      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
    >=64=100.0%

         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
    64=0.0%, >=64=0.0%

         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
    64=0.0%, >=64=0.1%

         issued    : total=r=0/w=4194304/d=0, short=r=0/w=0/d=0

         latency   : target=0, window=0, percentile=100.00%, depth=128

    Run status group 0 (all jobs):

      WRITE: io=16384MB, aggrb=20282KB/s, minb=20282KB/s,
    maxb=20282KB/s, mint=827178msec, maxt=827178msec

    Disk stats (read/write):

        dm-0: ios=0/4483641, merge=0/0, ticks=0/104928824,
    in_queue=105927128, util=100.00%, aggrios=1/4469640,
    aggrmerge=0/14788, aggrticks=64/103711096, aggrin_queue=104165356,
    aggrutil=100.00%

      vda: ios=1/4469640, merge=0/14788, ticks=64/103711096,
    in_queue=104165356, util=100.00%

    ##############################################

    ############### proxmox ve 4.x ###############

    kvm --version

    QEMU emulator version 2.4.0.1 pve-qemu-kvm_2.4-12, Copyright (c)
    2003-2008 Fabrice Bellard

    grep ceph /etc/pve/qemu-server/102.conf 

    virtio1:
    ceph_test:vm-102-disk-1,cache=writeback,iothread=on,size=100G

    root@fileserver-test:/daten/tv01/test# fio --time_based
    --name=benchmark --size=4G --filename=/mnt/test.bin
    --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1
    --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4
    --rw=randwrite --blocksize=4k --group_reporting           

    fio: time_based requires a runtime/timeout
    setting                                                                                      

    benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
    ioengine=libaio,
    iodepth=128                                                             

    ...                                                                                                                                                

    fio-2.1.11

    Starting 4 processes

    Jobs: 4 (f=4): [w(4)] [99.6% done] [0KB/56148KB/0KB /s] [0/14.4K/0
    iops] [eta 00m:01s]

    benchmark: (groupid=0, jobs=4): err= 0: pid=26131: Sun Nov 22
    03:51:04 2015

      write: io=0B, bw=59425KB/s, iops=14856, runt=282327msec

        slat (usec): min=6, max=216925, avg=261.78, stdev=1802.78

        clat (msec): min=1, max=330, avg=34.04, stdev=27.78

         lat (msec): min=1, max=330, avg=34.30, stdev=27.87

        clat percentiles (msec):

         |  1.00th=[   10],  5.00th=[   13], 10.00th=[   14],
    20.00th=[   16],

         | 30.00th=[   18], 40.00th=[   19], 50.00th=[   21],
    60.00th=[   24],

         | 70.00th=[   33], 80.00th=[   62], 90.00th=[   81],
    95.00th=[   87],

         | 99.00th=[   95], 99.50th=[  100], 99.90th=[  269], 99.95th=[ 
    277],

         | 99.99th=[  297]

        bw (KB  /s): min=    3, max=42216, per=25.10%, avg=14917.03,
    stdev=2990.50

        lat (msec) : 2=0.01%, 4=0.01%, 10=1.13%, 20=45.52%, 50=28.23%

        lat (msec) : 100=24.61%, 250=0.35%, 500=0.16%

      cpu          : usr=2.20%, sys=14.42%, ctx=2462199, majf=0, minf=40

      IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
    >=64=100.0%

         submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
    64=0.0%, >=64=0.0%

         complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
    64=0.0%, >=64=0.1%

         issued    : total=r=0/w=4194304/d=0, short=r=0/w=0/d=0

         latency   : target=0, window=0, percentile=100.00%, depth=128

    Run status group 0 (all jobs):

      WRITE: io=16384MB, aggrb=59424KB/s, minb=59424KB/s,
    maxb=59424KB/s, mint=282327msec, maxt=282327msec

    Disk stats (read/write):

        dm-0: ios=0/4192044, merge=0/0, ticks=0/35093432,
    in_queue=35116888, util=99.70%, aggrios=0/4194626, aggrmerge=0/14,
    aggrticks=0/34902692, aggrin_queue=34903976, aggrutil=99.65%

      vda: ios=0/4194626, merge=0/14, ticks=0/34902692,
    in_queue=34903976, util=99.65%

    ##############################################

    regards

    Udo

    On 19.11.2015 11:46, Sean Redmond
      wrote:

      Hi Mike/Warren,

        Thanks for helping out here. I am running the below fio
          command to test this with 4 jobs and a iodepth of 128

        fio --time_based --name=benchmark --size=4G
          --filename=/mnt/test.bin --ioengine=libaio --randrepeat=0
          --iodepth=128 --direct=1 --invalidate=1 --verify=0
          --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k
          --group_reportin

        The QEMU instance is created using nova, the settings I can
          see in the config are below:

              <disk type='network' device='disk'>
                <driver name='qemu' type='raw'
            cache='writeback'/>
                <auth username='$$'>
                  <secret type='ceph' uuid='$$'/>
                </auth>
                <source protocol='rbd'
            name='ssd_volume/volume-$$'>
                  <host name='$$' port='6789'/>
                  <host name='$$' port='6789'/>
                  <host name='$$' port='6789'/>
                </source>
                <target dev='vde' bus='virtio'/>
                <serial>$$</serial>
                <address type='pci' domain='0x0000' bus='0x00'
            slot='0x09' function='0x0'/>
              </disk>

        The below shows the output from running Fio:

          # fio --time_based --name=benchmark --size=4G
            --filename=/mnt/test.bin --ioengine=libaio --randrepeat=0
            --iodepth=128 --direct=1 --invalidate=1 --verify=0
            --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k
            --group_reporting
          fio: time_based requires a runtime/timeout setting
          benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
            ioengine=libaio, iodepth=128
          ...
          benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
            ioengine=libaio, iodepth=128
          fio-2.0.13
          Starting 4 processes
          Jobs: 3 (f=3): [_www] [99.7% done] [0K/36351K/0K /s] [0
            /9087 /0  iops] [eta 00m:03s]
          benchmark: (groupid=0, jobs=4): err= 0: pid=8547: Thu Nov
            19 05:16:31 2015
            write: io=16384MB, bw=19103KB/s, iops=4775 ,
            runt=878269msec
              slat (usec): min=4 , max=2339.4K, avg=807.17,
            stdev=12460.02
              clat (usec): min=1 , max=2469.6K, avg=106265.05,
            stdev=138893.39
               lat (usec): min=67 , max=2469.8K, avg=107073.04,
            stdev=139377.68
              clat percentiles (usec):
               |  1.00th=[ 1928],  5.00th=[ 9408], 10.00th=[12352],
            20.00th=[18816],
               | 30.00th=[43776], 40.00th=[64768], 50.00th=[78336],
            60.00th=[89600],
               | 70.00th=[102912], 80.00th=[123392],
            90.00th=[216064], 95.00th=[370688],
               | 99.00th=[733184], 99.50th=[782336],
            99.90th=[1044480], 99.95th=[2088960],
               | 99.99th=[2342912]
              bw (KB/s)  : min=    4, max=14968, per=26.11%,
            avg=4987.39, stdev=1947.67
              lat (usec) : 2=0.01%, 20=0.01%, 50=0.01%, 100=0.05%,
            250=0.30%
              lat (usec) : 500=0.24%, 750=0.11%, 1000=0.08%
              lat (msec) : 2=0.23%, 4=0.46%, 10=4.47%, 20=15.08%,
            50=11.28%
              lat (msec) : 100=35.47%, 250=23.52%, 500=5.92%,
            750=1.96%, 1000=0.70%
              lat (msec) : 2000=0.06%, >=2000=0.06%
            cpu          : usr=0.62%, sys=2.42%, ctx=1602209,
            majf=1, minf=101
            IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%,
            32=0.1%, >=64=100.0%
               submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
            32=0.0%, 64=0.0%, >=64=0.0%
               complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
            32=0.0%, 64=0.0%, >=64=0.1%
               issued    : total=r=0/w=4194304/d=0,
            short=r=0/w=0/d=0

          Run status group 0 (all jobs):
            WRITE: io=16384MB, aggrb=19102KB/s, minb=19102KB/s,
            maxb=19102KB/s, mint=878269msec, maxt=878269msec

          Disk stats (read/write):
            vde: ios=1119/4330437, merge=0/105599,
            ticks=556/121755054, in_queue=121749666, util=99.86

        The below shows lspci from within the guest:

          # lspci | grep -i scsi
          00:04.0 SCSI storage controller: Red Hat, Inc Virtio
            block devic

        Thanks

        On Wed, Nov 18, 2015 at 7:05 PM, Warren
          Wang - ISD <Warren.Wang@xxxxxxxxxxx>
          wrote:

          What were
            you using for iodepth and numjobs? If you’re getting an
            average of 2ms per operation, and you’re single threaded,
            I’d expect about 500 IOPS / thread, until you hit the limit
            of your QEMU setup, which may be a single IO thread. That’s
            also what I think Mike is alluding to.

            Warren

            From: Sean Redmond <sean.redmond1@xxxxxxxxx<mailto:sean.redmond1@xxxxxxxxx>>

            Date: Wednesday, November 18, 2015 at 6:39 AM

            To: "ceph-users@xxxxxxxx<mailto:ceph-users@xxxxxxxx>"
            <ceph-users@xxxxxxxx<mailto:ceph-users@xxxxxxxx>>

            Subject:  All SSD Pool - Odd Performance

                Hi,

                I have a performance question for anyone running an SSD
                only pool. Let me detail the setup first.

                12 X Dell PowerEdge R630 ( 2 X 2620v3 64Gb RAM)

                8 X intel DC 3710 800GB

                Dual port Solarflare 10GB/s NIC (one front and one back)

                Ceph 0.94.5

                Ubuntu 14.04 (3.13.0-68-generic)

                The above is in one pool that is used for QEMU guests, A
                4k FIO test on the SSD directly yields around 55k Iops,
                the same test inside a QEMU guest seems to hit a limit
                around 4k Iops. If I deploy multiple guests they can all
                reach 4K Iops simultaneously.

                I don't see any evidence of a bottle neck on the OSD
                hosts,Is this limit inside the guest expected or I am
                just not looking deep enough yet?

                Thanks

            This email and any files transmitted with it are
            confidential and intended solely for the individual or
            entity to whom they are addressed. If you have received this
            email in error destroy it immediately. *** Walmart
            Confidential ***

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com