Re: New cluster performance analysis

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kris,

Indeed I am seeing some spikes on the latency, they seem to be linked to other spikes on throughput and cluster global IOPS. I also see some spikes on the OSD (I guess this is when the journal is flushed) but IO on the journals are quite steady. I already tuned a bit the osd filestore and journal parameters to check there wasn't a limitation hidden somewhere that could explain what I originally didn't understand.

As you said I will need to check the behavior of the cluster under the actual workload and adjust accordingly. That should happen some time next week.

Thanks for your input :)

On Wed, Dec 9, 2015 at 5:10 PM, Kris Gillespie <kgillespie@xxxxxxx> wrote:
One thing I noticed with all my testing, as the speed difference between the SSDs and the spinning rust can be quite high and as your journal needs to flush every X bytes (configurable), the impact of this flush can be hard, as IO to this journal will stop until it’s finished (I believe). Something to see, run a fio test but also log the latency stats and then graph them. Should make the issue pretty clear. I’ll predict you’re gonna see some spikes.

If so, you may need to

a) decide if its a problem with the future defined workload - maybe it’s not so bursty….
b) have a look at http://docs.ceph.com/docs/hammer/rados/configuration/journal-ref/ and maybe tweak the “journal max writes bytes” or the others

There won’t be a golden rule here however and it’s one of the reasons some benchmarks can lead to unfounded worrying. 

Cheers

Kris


On 04 Dec 2015, at 15:10, Jan Schermer <jan@xxxxxxxxxxx> wrote:


On 04 Dec 2015, at 14:31, Adrien Gillard <gillard.adrien@xxxxxxxxx> wrote:

After some more tests :

 - The pool being used as cache pool has no impact on performance, I get the same results with a "dedicated" replicated pool.
 - You are right Jan, on raw devices I get better performance on a volume if I fill it first, or at least if I write a zone that already has been allocated
 - The same seem to apply when the test is run on the mounted filesystem.


Yeah. The the first (raw device) is because the objects on OSDs get "thick" in the process.
The second (filesystem) is because of both the OSD objects getting thick and the guest filesystem getting thick.
Preallocating the space can speed up things considerably (like 100x)).
Unfortunately I haven't found a way to convince fallocate() &co. to thick provision files.

Jan





On Thu, Dec 3, 2015 at 2:49 PM, Adrien Gillard <gillard.adrien@xxxxxxxxx> wrote:
I did some more tests : 

fio on a raw RBD volume (4K, numjob=32, QD=1) gives me around 3000 IOPS

I also tuned xfs mount options on client (I realized I didn't do that already) and with "largeio,inode64,swalloc,logbufs=8,logbsize=256k,attr2,auto,nodev,noatime,nodiratime" I get better performance :

4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=26793: Thu Dec  3 10:45:55 2015
  write: io=1685.3MB, bw=5720.1KB/s, iops=1430, runt=301652msec
    slat (usec): min=5, max=1620, avg=41.61, stdev=25.82
    clat (msec): min=1, max=4141, avg=14.61, stdev=112.55
     lat (msec): min=1, max=4141, avg=14.65, stdev=112.55
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    5], 80.00th=[    6], 90.00th=[    7], 95.00th=[    7],
     | 99.00th=[  227], 99.50th=[  717], 99.90th=[ 1844], 99.95th=[ 2245],
     | 99.99th=[ 3097]

So, more than 50% improvement but it actually varies quite a lot between tests (sometimes I get a bit more than 1000). If I run the test fo 30 minutes it drops to 900 IOPS.

As you suggested I also filled a volume with zeros (dd if=/dev/zero of=/dev/rbd1 bs=1M) and then ran fio on the raw device, I didn't see a lot of improvement.

If I run fio test directly on block devices I seem to saturate the spinners, [1] is a graph of IO load on one of the OSD host. 
[2] is the same OSD graph but when the test is done on a device mounted and formatted with XFS on the client. 
If I get half of the IOPS on the XFS volume because of the journal, shouldn't I get the same amount of IOPS on the backend ? 
[3] shows what happen if I run the test for 30 minutes.

During the fio tests on the raw device, load average on the OSD servers increases up to 13/14 and I get a bit of iowait (I guess because the OSD are busy)
During the fio tests on the raw device, load average on the OSD servers peaks at the beginning and decreases to 5/6, but goes trough the roof on the client.
Scheduler is deadline for all the drives, I didn't try to change it yet.

What I don't understand, even with your explanations, are the rados results. From what I understand it performs at the RADOS level and thus should not be impacted by client filesystem.
Given the results above I guess you are right and this has to do with the client filesystem.

The cluster will be used for backups, write IO size during backups is around 150/200K (I guess mostly sequential) and I am looking for the highest bandwith and parallelization.

@Nick, I will try to create a new stand alone replicated pool.



On Thu, Dec 3, 2015 at 1:30 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

Couple of things to check

1.      Can you create just a normal non cached pool and test performance to rule out any funnies going on there.

2.      Can you also run something like iostat during the benchmarks and see if it looks like all your disks are getting saturated.



      _____________________________________________
      From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Adrien Gillard
      Sent: 02 December 2015 21:33
      To: ceph-users@xxxxxxxx
      Subject:  New cluster performance analysis


      Hi everyone, 

       

      I am currently testing our new cluster and I would like some feedback on the numbers I am getting.

       

      For the hardware :

      7 x OSD : 2 x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP for public net., 2x10Gbits LACP for cluster net., MTU 9000

      1 x MON : 2 x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD, 2x10Gbits LACP for public net., MTU 9000

      2 x MON : VMs (8 cores, 8GB RAM), backed by SSD

       

      Journals are 20GB partitions on SSD

       

      The system is CentOS 7.1 with stock kernel (3.10.0-229.20.1.el7.x86_64). No particular system optimizations.

       

      Ceph is Infernalis from Ceph repository  : ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

       

      [cephadm@cph-adm-01  ~/scripts]$ ceph -s

          cluster 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce

           health HEALTH_OK

           monmap e1: 3 mons at {clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0}

                  election epoch 62, quorum 0,1,2 clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03

           osdmap e844: 84 osds: 84 up, 84 in

                  flags sortbitwise

            pgmap v111655: 3136 pgs, 3 pools, 3166 GB data, 19220 kobjects

                  8308 GB used, 297 TB / 305 TB avail

                      3136 active+clean

       

      My ceph.conf :

       

      [global]

      fsid = 259f65a3-d6c8-4c90-a9c2-71d4c3c55cce

      mon_initial_members = clb-cph-frpar2-mon-01, clb-cph-frpar1-mon-02, clb-cph-frpar2-mon-03

      mon_host = x.x.x.1,x.x.x.2,x.x.x.3

      auth_cluster_required = cephx

      auth_service_required = cephx

      auth_client_required = cephx

      filestore_xattr_use_omap = true

      public network = 10.25.25.0/24

      cluster network = 10.25.26.0/24

      debug_lockdep = 0/0

      debug_context = 0/0

      debug_crush = 0/0

      debug_buffer = 0/0

      debug_timer = 0/0

      debug_filer = 0/0

      debug_objecter = 0/0

      debug_rados = 0/0

      debug_rbd = 0/0

      debug_journaler = 0/0

      debug_objectcatcher = 0/0

      debug_client = 0/0

      debug_osd = 0/0

      debug_optracker = 0/0

      debug_objclass = 0/0

      debug_filestore = 0/0

      debug_journal = 0/0

      debug_ms = 0/0

      debug_monc = 0/0

      debug_tp = 0/0

      debug_auth = 0/0

      debug_finisher = 0/0

      debug_heartbeatmap = 0/0

      debug_perfcounter = 0/0

      debug_asok = 0/0

      debug_throttle = 0/0

      debug_mon = 0/0

      debug_paxos = 0/0

      debug_rgw = 0/0

       

      [osd]

      osd journal size = 0

      osd mount options xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k"

      filestore min sync interval = 5

      filestore max sync interval = 15

      filestore queue max ops = 2048

      filestore queue max bytes = 1048576000

      filestore queue committing max ops = 4096

      filestore queue committing max bytes = 1048576000

      filestore op thread = 32

      filestore journal writeahead = true

      filestore merge threshold = 40

      filestore split multiple = 8

       

      journal max write bytes = 1048576000

      journal max write entries = 4096

      journal queue max ops = 8092

      journal queue max bytes = 1048576000

       

      osd max write size = 512

      osd op threads = 16

      osd disk threads = 2

      osd op num threads per shard = 3

      osd op num shards = 10

      osd map cache size = 1024

      osd max backfills = 1

      osd recovery max active = 2

       

      I have set up 2 pools : one for cache with 3x replication in front of an EC pool. At the moment I am only interested in the cache pool, so no promotions/flushes/evictions happen. 

      (I know, I am using the same set of OSD for hot and cold data, but in my use case they should not be used at the same time.)

       

      I am accessing the cluster via RBD volumes mapped with the kernel module on CentOS 7.1. These volumes are formatted in XFS on the clients.

       

      The journal SSDs seem to perform quite well according to the results of Sebastien Han’s benchmark suggestion (they are Sandisk) :

      write: io=22336MB, bw=381194KB/s, iops=95298, runt= 60001msec (this is for numjob=10)

       

      Here are the rados bench tests :

       

      rados bench -p rbdcache 120 write -b 4K -t 32 --no-cleanup

      Total time run:         121.410763

      Total writes made:      65357

      Write size:             4096

      Bandwidth (MB/sec):     2.1

      Stddev Bandwidth:       0.597

      Max bandwidth (MB/sec): 3.89

      Min bandwidth (MB/sec): 0.00781

      Average IOPS:           538

      Stddev IOPS:            152

      Max IOPS:               995

      Min IOPS:               2

      Average Latency:        0.0594

      Stddev Latency:         0.18

      Max latency:            2.82

      Min latency:            0.00494

       

      And the results of the fio test with the following parameters :

       

      [global]

      size=8G

      runtime=300

      ioengine=libaio

      invalidate=1

      direct=1

      sync=1

      fsync=1

      numjobs=32

      rw=randwrite

      name=4k-32-1-randwrite-libaio

      blocksize=4K

      iodepth=1

      directory=/mnt/rbd

      group_reporting=1

      4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=20442: Wed Dec  2 21:38:30 2015

        write: io=992.11MB, bw=3389.3KB/s, iops=847, runt=300011msec

          slat (usec): min=5, max=4726, avg=40.32, stdev=41.28

          clat (msec): min=2, max=2208, avg=19.35, stdev=74.34

           lat (msec): min=2, max=2208, avg=19.39, stdev=74.34

          clat percentiles (msec):

           |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],

           | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],

           | 70.00th=[    6], 80.00th=[    7], 90.00th=[   38], 95.00th=[   63],

           | 99.00th=[  322], 99.50th=[  570], 99.90th=[ 1074], 99.95th=[ 1221],

           | 99.99th=[ 1532]

          bw (KB  /s): min=    1, max=  448, per=3.64%, avg=123.48, stdev=102.09

          lat (msec) : 4=30.30%, 10=51.27%, 20=1.71%, 50=9.91%, 100=4.03%

          lat (msec) : 250=1.55%, 500=0.62%, 750=0.33%, 1000=0.16%

        cpu          : usr=0.09%, sys=0.25%, ctx=963114, majf=0, minf=928

        IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

           submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

           complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

           issued    : total=r=0/w=254206/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

           latency   : target=0, window=0, percentile=100.00%, depth=1

      Run status group 0 (all jobs):

        WRITE: io=992.11MB, aggrb=3389KB/s, minb=3389KB/s, maxb=3389KB/s, mint=300011msec, maxt=300011msec

      Disk stats (read/write):

        rbd0: ios=0/320813, merge=0/10001, ticks=0/5670847, in_queue=5677825, util=100.00%



      And a job closer to what the actual workload would be (blocksize=200K, numjob=16, QD=32)

      200k-16-32-randwrite-libaio: (groupid=0, jobs=16): err= 0: pid=4828: Wed Dec  2 18:58:53 2015

        write: io=47305MB, bw=161367KB/s, iops=806, runt=300189msec

          slat (usec): min=17, max=358430, avg=155.11, stdev=2361.49

          clat (msec): min=9, max=3584, avg=613.88, stdev=168.68

           lat (msec): min=10, max=3584, avg=614.04, stdev=168.66

          clat percentiles (msec):

           |  1.00th=[  375],  5.00th=[  469], 10.00th=[  502], 20.00th=[  537],

           | 30.00th=[  553], 40.00th=[  578], 50.00th=[  594], 60.00th=[  603],

           | 70.00th=[  627], 80.00th=[  652], 90.00th=[  701], 95.00th=[  881],

           | 99.00th=[ 1205], 99.50th=[ 1483], 99.90th=[ 2638], 99.95th=[ 2671],

           | 99.99th=[ 2999]

          bw (KB  /s): min=  260, max=18181, per=6.31%, avg=10189.40, stdev=2009.86

          lat (msec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, 250=0.08%

          lat (msec) : 500=9.26%, 750=83.21%, 1000=4.09%

        cpu          : usr=0.22%, sys=0.55%, ctx=719279, majf=0, minf=433

        IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0%

           submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

           complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

           issued    : total=r=0/w=242203/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

           latency   : target=0, window=0, percentile=100.00%, depth=32

      Run status group 0 (all jobs):

        WRITE: io=47305MB, aggrb=161367KB/s, minb=161367KB/s, maxb=161367KB/s, mint=300189msec, maxt=300189msec

      Disk stats (read/write):

        rbd0: ios=1/287809, merge=0/18393, ticks=50/5887593, in_queue=5887504, util=99.91%


      The 4k block performance does not interest me so much but is given as a reference. I am more looking for throughput, but anyway, the numbers seem quite low.

      Let's take IOPS, assuming the spinners can do 50 (4k) synced sustained IOPS (I hope they can do more ^^), we should be around 50x84/3 = 1400 IOPS, which is far from rados bench (538) and fio (847). And surprisingly fio numbers are greater than rados.

      So I don't know wether I am missing something here or if something is going wrong (maybe both !).

      Any input would be very valuable.

      Thank you,

      Adrien << File: ATT00001.txt >> 





-- 
-----------------------------------------------------------------------------------------
Adrien GILLARD

+33 (0)6 29 06 16 31
gillard.adrien@xxxxxxxxx



-- 
-----------------------------------------------------------------------------------------
Adrien GILLARD

+33 (0)6 29 06 16 31
gillard.adrien@xxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
-----------------------------------------------------------------------------------------
Adrien GILLARD

+33 (0)6 29 06 16 31
gillard.adrien@xxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux