Thanks for your suggestion, I have re-test my cluster and result is much better. Regards, Yang ------------------ Original ------------------ From: "Christian Balzer";<chibi@xxxxxxx>; Date: Tue, Feb 23, 2016 09:49 PM To: "ceph-users"<ceph-users@xxxxxxxxxxxxxx>; Cc: "yang"<justyuyang@xxxxxxxxxxx>; Subject: Re: Why my cluster performance is so bad? Hello, This is sort of a FAQ, google is your friend. For example find the recent thread "Performance Testing of CEPH on ARM MicroServer" in this ML which addresses some points pertinent to your query. Read it, I will reference things from it below On Tue, 23 Feb 2016 19:55:22 +0800 yang wrote: > My ceph cluster config: Kernel, OS, Ceph version. > 7 nodes(including 3 mons, 3 mds). > 9 SATA HDD in every node and each HDD as an OSD&journal(deployed by What replication, default of 3? That would give the theoretical IOPS of 21 HDDs, but your slow (more precisely high latency) network and lack of SSD journals mean it will be even lower than that. > ceph-deploy). CPU: 32core > Mem: 64GB > public network: 1Gbx2 bond0, > cluster network: 1Gbx2 bond0. Latency in that kind of network will slow you down, especially when doing small I/Os. > As always, atop is a very nice tool to find where the bottlenecks and hotspots are, you will have to run it preferably on all storage nodes with nice large terminal windows to the get the most out of it, though. > The read bw is 109910KB/s for 1M-read, and 34329KB/s for 1M-write. > Why is it so bad? Because your testing is flawed. > Anyone who can give me some suggestion? > For starters to get a good baseline, do rados bench tests (see thread) with the default block size (4MB) and 4KB size. > > fio jobfile: > [global] > direct=1 > thread Not sure how this affects things versus the default of fork. > ioengine=psync Definitely never used this, either use libaio or the rbd engine in newer fio versions. > size=10G > runtime=300 > time_based > iodepth=10 This is your main problem, Ceph/RBD does not do well with a low number of threads. Simply because you're likely to hit just a single OSD for a prolonged time, thus getting more or less single disk speeds. See more about this in the results below. > group_reporting > stonewall > filename=/mnt/rbd/data Are we to assume that this mounted via the kernel RBD module? Where, different client node that's not part of the cluster? Which FS? > > [read1M] > bs=1M > rw=read > numjobs=1 > name=read1M > > [write1M] > bs=1M > rw=write > numjobs=1 > name=write1M > > [read4k-seq] > bs=4k > rw=read > numjobs=8 > name=read4k-seq > > [read4k-rand] > bs=4k > rw=randread > numjobs=8 > name=read4k-rand > > [write4k-seq] > bs=4k > rw=write > numjobs=8 > name=write4k-seq > > [write4k-rand] > bs=4k > rw=randwrite > numjobs=8 > name=write4k-rand > > > and the fio result is as follows: > > read1M: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=psync, iodepth=10 > write1M: (g=1): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=psync, > iodepth=10 read4k-seq: (g=2): rw=read, bs=4K-4K/4K-4K/4K-4K, > ioengine=psync, iodepth=10 ... > read4k-rand: (g=3): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, > iodepth=10 ... > write4k-seq: (g=4): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, > iodepth=10 ... > write4k-rand: (g=5): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, > iodepth=10 ... > fio-2.3 > Starting 34 threads > read1M: Laying out IO file(s) (1 file(s) / 10240MB) > Jobs: 8 (f=8): [_(26),w(8)] [18.8% done] [0KB/1112KB/0KB /s] [0/278/0 > iops] [eta 02h:10m:00s] read1M: (groupid=0, jobs=1): err= 0: pid=17606: > Tue Feb 23 14:28:45 2016 read : io=32201MB, bw=109910KB/s, iops=107, > runt=300007msec clat (msec): min=1, max=74, avg= 9.31, stdev= 2.78 > lat (msec): min=1, max=74, avg= 9.31, stdev= 2.78 > clat percentiles (usec): > | 1.00th=[ 1448], 5.00th=[ 2040], 10.00th=[ 3952], > 20.00th=[ 9792], | 30.00th=[ 9920], 40.00th=[ 9920], 50.00th=[ 9920], > 60.00th=[10048], | 70.00th=[10176], 80.00th=[10304], 90.00th=[10688], > 95.00th=[10944], | 99.00th=[11968], 99.50th=[19072], 99.90th=[27008], > 99.95th=[29568], | 99.99th=[38144] > bw (KB /s): min=93646, max=139912, per=100.00%, avg=110022.09, > stdev=7759.48 lat (msec) : 2=4.20%, 4=5.98%, 10=43.37%, 20=46.00%, > 50=0.45% lat (msec) : 100=0.01% > cpu : usr=0.05%, sys=0.81%, ctx=32209, majf=0, minf=1055 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, According to this output, the IO depth was actually 1, not 10, probably caused by the choice of your engine or the threads option. And this explains a LOT of your results. Regards, Christian > >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% issued : total=r=32201/w=0/d=0, short=r=0/w=0/d=0, > >drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=10 > write1M: (groupid=1, jobs=1): err= 0: pid=23779: Tue Feb 23 14:28:45 2016 > write: io=10058MB, bw=34329KB/s, iops=33, runt=300018msec > clat (msec): min=20, max=565, avg=29.80, stdev= 8.84 > lat (msec): min=20, max=565, avg=29.83, stdev= 8.84 > clat percentiles (msec): > | 1.00th=[ 22], 5.00th=[ 22], 10.00th=[ 23], > 20.00th=[ 30], | 30.00th=[ 31], 40.00th=[ 31], 50.00th=[ 31], > 60.00th=[ 31], | 70.00th=[ 31], 80.00th=[ 32], 90.00th=[ 32], > 95.00th=[ 33], | 99.00th=[ 35], 99.50th=[ 38], 99.90th=[ 118], > 99.95th=[ 219], | 99.99th=[ 322] > bw (KB /s): min= 3842, max=40474, per=100.00%, avg=34408.82, > stdev=2751.05 lat (msec) : 50=99.83%, 100=0.06%, 250=0.06%, 500=0.04%, > 750=0.01% cpu : usr=0.11%, sys=0.22%, ctx=10101, majf=0, > minf=1050 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, > 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, > 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, > 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=10058/d=0, > short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, > percentile=100.00%, depth=10 read4k-seq: (groupid=2, jobs=8): err= 0: > pid=27771: Tue Feb 23 14:28:45 2016 read : io=12892MB, bw=44003KB/s, > iops=11000, runt=300002msec clat (usec): min=143, max=38808, avg=725.61, > stdev=457.02 lat (usec): min=143, max=38808, avg=725.75, stdev=457.03 > clat percentiles (usec): > | 1.00th=[ 270], 5.00th=[ 358], 10.00th=[ 398], > 20.00th=[ 462], | 30.00th=[ 510], 40.00th=[ 548], 50.00th=[ 588], > 60.00th=[ 652], | 70.00th=[ 732], 80.00th=[ 876], 90.00th=[ 1176], > 95.00th=[ 1576], | 99.00th=[ 2640], 99.50th=[ 3024], 99.90th=[ 4128], > 99.95th=[ 4448], | 99.99th=[ 4960] > bw (KB /s): min= 958, max=12784, per=12.51%, avg=5505.10, > stdev=2094.64 lat (usec) : 250=0.27%, 500=27.64%, 750=44.00%, 1000=13.45% > lat (msec) : 2=11.65%, 4=2.88%, 10=0.12%, 20=0.01%, 50=0.01% > cpu : usr=0.44%, sys=1.64%, ctx=3300370, majf=0, minf=237 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% issued : total=r=3300226/w=0/d=0, short=r=0/w=0/d=0, > >drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=10 > read4k-rand: (groupid=3, jobs=8): err= 0: pid=29341: Tue Feb 23 14:28:45 > 2016 read : io=3961.9MB, bw=13520KB/s, iops=3380, runt=300061msec > clat (usec): min=222, max=1033.9K, avg=2364.21, stdev=7609.07 > lat (usec): min=222, max=1033.9K, avg=2364.37, stdev=7609.07 > clat percentiles (usec): > | 1.00th=[ 402], 5.00th=[ 474], 10.00th=[ 556], > 20.00th=[ 684], | 30.00th=[ 772], 40.00th=[ 828], 50.00th=[ 876], > 60.00th=[ 924], | 70.00th=[ 980], 80.00th=[ 1048], 90.00th=[ 1304], > 95.00th=[ 9408], | 99.00th=[40704], 99.50th=[55040], 99.90th=[85504], > 99.95th=[98816], | 99.99th=[130560] > bw (KB /s): min= 16, max= 3096, per=12.53%, avg=1694.57, > stdev=375.66 lat (usec) : 250=0.01%, 500=6.78%, 750=19.91%, 1000=46.99% > lat (msec) : 2=18.68%, 4=0.77%, 10=2.07%, 20=2.00%, 50=2.17% > lat (msec) : 100=0.58%, 250=0.05%, 500=0.01%, 2000=0.01% > cpu : usr=0.19%, sys=0.56%, ctx=1014562, majf=0, minf=3463 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% issued : total=r=1014228/w=0/d=0, short=r=0/w=0/d=0, > >drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=10 > write4k-seq: (groupid=4, jobs=8): err= 0: pid=1012: Tue Feb 23 14:28:45 > 2016 write: io=417684KB, bw=1392.2KB/s, iops=348, runt=300025msec > clat (msec): min=1, max=961, avg=22.98, stdev=61.67 > lat (msec): min=1, max=961, avg=22.98, stdev=61.67 > clat percentiles (msec): > | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], > 20.00th=[ 3], | 30.00th=[ 3], 40.00th=[ 3], 50.00th=[ 3], > 60.00th=[ 3], | 70.00th=[ 4], 80.00th=[ 4], 90.00th=[ 76], > 95.00th=[ 151], | 99.00th=[ 310], 99.50th=[ 379], 99.90th=[ 529], > 99.95th=[ 586], | 99.99th=[ 791] > bw (KB /s): min= 4, max= 1568, per=12.88%, avg=179.28, > stdev=149.54 lat (msec) : 2=0.02%, 4=84.42%, 10=0.15%, 20=0.02%, 50=2.71% > lat (msec) : 100=4.71%, 250=6.14%, 500=1.70%, 750=0.11%, 1000=0.01% > cpu : usr=0.04%, sys=0.26%, ctx=208926, majf=0, minf=223 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% issued : total=r=0/w=104421/d=0, short=r=0/w=0/d=0, > >drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=10 > write4k-rand: (groupid=5, jobs=8): err= 0: pid=5210: Tue Feb 23 14:28:45 > 2016 write: io=358724KB, bw=1195.7KB/s, iops=298, runt=300025msec > clat (msec): min=2, max=1424, avg=26.76, stdev=45.71 > lat (msec): min=2, max=1424, avg=26.76, stdev=45.71 > clat percentiles (msec): > | 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], > 20.00th=[ 3], | 30.00th=[ 4], 40.00th=[ 4], 50.00th=[ 4], > 60.00th=[ 4], | 70.00th=[ 5], 80.00th=[ 62], 90.00th=[ 92], > 95.00th=[ 116], | 99.00th=[ 165], 99.50th=[ 192], 99.90th=[ 330], > 99.95th=[ 506], | 99.99th=[ 1074] > bw (KB /s): min= 3, max= 1045, per=12.62%, avg=150.75, > stdev=72.67 lat (msec) : 4=66.90%, 10=4.74%, 20=0.04%, 50=4.10%, > 100=16.32% lat (msec) : 250=7.71%, 500=0.12%, 750=0.03%, 1000=0.02%, > 2000=0.01% cpu : usr=0.04%, sys=0.22%, ctx=181061, majf=0, > minf=3460 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, > 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, > 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, > 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=89681/d=0, > short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, > percentile=100.00%, depth=10 > > Run status group 0 (all jobs): > READ: io=32201MB, aggrb=109910KB/s, minb=109910KB/s, maxb=109910KB/s, > mint=300007msec, maxt=300007msec > > Run status group 1 (all jobs): > WRITE: io=10058MB, aggrb=34329KB/s, minb=34329KB/s, maxb=34329KB/s, > mint=300018msec, maxt=300018msec > > Run status group 2 (all jobs): > READ: io=12892MB, aggrb=44002KB/s, minb=44002KB/s, maxb=44002KB/s, > mint=300002msec, maxt=300002msec > > Run status group 3 (all jobs): > READ: io=3961.9MB, aggrb=13520KB/s, minb=13520KB/s, maxb=13520KB/s, > mint=300061msec, maxt=300061msec > > Run status group 4 (all jobs): > WRITE: io=417684KB, aggrb=1392KB/s, minb=1392KB/s, maxb=1392KB/s, > mint=300025msec, maxt=300025msec > > Run status group 5 (all jobs): > WRITE: io=358724KB, aggrb=1195KB/s, minb=1195KB/s, maxb=1195KB/s, > mint=300025msec, maxt=300025msec > > Disk stats (read/write): > rbd0: ios=4378850/204497, merge=0/10238, ticks=5213536/1162412, > in_queue=6374696, util=99.44% > _______________________________________________ ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com