Re: Why my cluster performance is so bad?

"yang" <justyuyang@xxxxxxxxxxx> · Fri, 26 Feb 2016 09:58:28 +0800

Thanks for your suggestion,
I have re-test my cluster and result is much better.

Regards,
Yang

------------------ Original ------------------
From:  "Christian Balzer";<chibi@xxxxxxx>;
Date:  Tue, Feb 23, 2016 09:49 PM
To:  "ceph-users"<ceph-users@xxxxxxxxxxxxxx>;
Cc:  "yang"<justyuyang@xxxxxxxxxxx>;
Subject:  Re:  Why my cluster performance is so bad?

Hello,

This is sort of a FAQ, google is your friend.

For example find the recent thread "Performance Testing of CEPH on ARM
MicroServer" in this ML which addresses some points pertinent to your query.
Read it, I will reference things from it below

On Tue, 23 Feb 2016 19:55:22 +0800 yang wrote:

> My ceph cluster config:
Kernel, OS, Ceph version.

> 7 nodes(including 3 mons, 3 mds).
> 9 SATA HDD in every node and each HDD as an OSD&journal(deployed by
What replication, default of 3?

That would give the theoretical IOPS of 21 HDDs, but your slow (more
precisely high latency) network and lack of SSD journals mean it will be
even lower than that.

> ceph-deploy). CPU:  32core
> Mem: 64GB
> public network: 1Gbx2 bond0,
> cluster network: 1Gbx2 bond0.
Latency in that kind of network will slow you down, especially when doing
small I/Os.

> 
As always, atop is a very nice tool to find where the bottlenecks and
hotspots are, you will have to run it preferably on all storage nodes with
nice large terminal windows to the get the most out of it, though.

> The read bw is 109910KB/s for 1M-read, and 34329KB/s for 1M-write.
> Why is it so bad?

Because your testing is flawed.

> Anyone who can give me some suggestion?
>
For starters to get a good baseline, do rados bench tests (see thread)
with the default block size (4MB) and 4KB size.

> 
> fio jobfile:
> [global]
> direct=1
> thread
Not sure how this affects things versus the default of fork.

> ioengine=psync
Definitely never used this, either use libaio or the rbd engine in newer
fio versions.

> size=10G
> runtime=300
> time_based
> iodepth=10
This is your main problem, Ceph/RBD does not do well with a low number of
threads.
Simply because you're likely to hit just a single OSD for a prolonged
time, thus getting more or less single disk speeds.

See more about this in the results below.

> group_reporting
> stonewall
> filename=/mnt/rbd/data

Are we to assume that this mounted via the kernel RBD module?
Where, different client node that's not part of the cluster?
Which FS?

> 
> [read1M]
> bs=1M
> rw=read
> numjobs=1
> name=read1M
> 
> [write1M]
> bs=1M
> rw=write
> numjobs=1
> name=write1M
> 
> [read4k-seq]
> bs=4k
> rw=read
> numjobs=8
> name=read4k-seq
> 
> [read4k-rand]
> bs=4k
> rw=randread
> numjobs=8
> name=read4k-rand
> 
> [write4k-seq]
> bs=4k
> rw=write
> numjobs=8
> name=write4k-seq
> 
> [write4k-rand]
> bs=4k
> rw=randwrite
> numjobs=8
> name=write4k-rand
> 
> 
> and the fio result is as follows:
> 
> read1M: (g=0): rw=read, bs=1M-1M/1M-1M/1M-1M, ioengine=psync, iodepth=10
> write1M: (g=1): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=psync,
> iodepth=10 read4k-seq: (g=2): rw=read, bs=4K-4K/4K-4K/4K-4K,
> ioengine=psync, iodepth=10 ...
> read4k-rand: (g=3): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync,
> iodepth=10 ...
> write4k-seq: (g=4): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync,
> iodepth=10 ...
> write4k-rand: (g=5): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=psync,
> iodepth=10 ...
> fio-2.3
> Starting 34 threads
> read1M: Laying out IO file(s) (1 file(s) / 10240MB)
> Jobs: 8 (f=8): [_(26),w(8)] [18.8% done] [0KB/1112KB/0KB /s] [0/278/0
> iops] [eta 02h:10m:00s] read1M: (groupid=0, jobs=1): err= 0: pid=17606:
> Tue Feb 23 14:28:45 2016 read : io=32201MB, bw=109910KB/s, iops=107,
> runt=300007msec clat (msec): min=1, max=74, avg= 9.31, stdev= 2.78
>      lat (msec): min=1, max=74, avg= 9.31, stdev= 2.78
>     clat percentiles (usec):
>      |  1.00th=[ 1448],  5.00th=[ 2040], 10.00th=[ 3952],
> 20.00th=[ 9792], | 30.00th=[ 9920], 40.00th=[ 9920], 50.00th=[ 9920],
> 60.00th=[10048], | 70.00th=[10176], 80.00th=[10304], 90.00th=[10688],
> 95.00th=[10944], | 99.00th=[11968], 99.50th=[19072], 99.90th=[27008],
> 99.95th=[29568], | 99.99th=[38144]
>     bw (KB  /s): min=93646, max=139912, per=100.00%, avg=110022.09,
> stdev=7759.48 lat (msec) : 2=4.20%, 4=5.98%, 10=43.37%, 20=46.00%,
> 50=0.45% lat (msec) : 100=0.01%
>   cpu          : usr=0.05%, sys=0.81%, ctx=32209, majf=0, minf=1055
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,

According to this output, the IO depth was actually 1, not 10, probably
caused by the choice of your engine or the threads option.
And this explains a LOT of your results.

Regards,

Christian
> >=64=0.0% submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0% issued    : total=r=32201/w=0/d=0, short=r=0/w=0/d=0,
> >drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=10
> write1M: (groupid=1, jobs=1): err= 0: pid=23779: Tue Feb 23 14:28:45 2016
>   write: io=10058MB, bw=34329KB/s, iops=33, runt=300018msec
>     clat (msec): min=20, max=565, avg=29.80, stdev= 8.84
>      lat (msec): min=20, max=565, avg=29.83, stdev= 8.84
>     clat percentiles (msec):
>      |  1.00th=[   22],  5.00th=[   22], 10.00th=[   23],
> 20.00th=[   30], | 30.00th=[   31], 40.00th=[   31], 50.00th=[   31],
> 60.00th=[   31], | 70.00th=[   31], 80.00th=[   32], 90.00th=[   32],
> 95.00th=[   33], | 99.00th=[   35], 99.50th=[   38], 99.90th=[  118],
> 99.95th=[  219], | 99.99th=[  322]
>     bw (KB  /s): min= 3842, max=40474, per=100.00%, avg=34408.82,
> stdev=2751.05 lat (msec) : 50=99.83%, 100=0.06%, 250=0.06%, 500=0.04%,
> 750=0.01% cpu          : usr=0.11%, sys=0.22%, ctx=10101, majf=0,
> minf=1050 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
> 32=0.0%, >=64=0.0% submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.0%, 64=0.0%, >=64=0.0% complete  : 0=0.0%, 4=100.0%, 8=0.0%,
> 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued    : total=r=0/w=10058/d=0,
> short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency   : target=0, window=0,
> percentile=100.00%, depth=10 read4k-seq: (groupid=2, jobs=8): err= 0:
> pid=27771: Tue Feb 23 14:28:45 2016 read : io=12892MB, bw=44003KB/s,
> iops=11000, runt=300002msec clat (usec): min=143, max=38808, avg=725.61,
> stdev=457.02 lat (usec): min=143, max=38808, avg=725.75, stdev=457.03
>     clat percentiles (usec):
>      |  1.00th=[  270],  5.00th=[  358], 10.00th=[  398],
> 20.00th=[  462], | 30.00th=[  510], 40.00th=[  548], 50.00th=[  588],
> 60.00th=[  652], | 70.00th=[  732], 80.00th=[  876], 90.00th=[ 1176],
> 95.00th=[ 1576], | 99.00th=[ 2640], 99.50th=[ 3024], 99.90th=[ 4128],
> 99.95th=[ 4448], | 99.99th=[ 4960]
>     bw (KB  /s): min=  958, max=12784, per=12.51%, avg=5505.10,
> stdev=2094.64 lat (usec) : 250=0.27%, 500=27.64%, 750=44.00%, 1000=13.45%
>     lat (msec) : 2=11.65%, 4=2.88%, 10=0.12%, 20=0.01%, 50=0.01%
>   cpu          : usr=0.44%, sys=1.64%, ctx=3300370, majf=0, minf=237
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0% submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0% issued    : total=r=3300226/w=0/d=0, short=r=0/w=0/d=0,
> >drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=10
> read4k-rand: (groupid=3, jobs=8): err= 0: pid=29341: Tue Feb 23 14:28:45
> 2016 read : io=3961.9MB, bw=13520KB/s, iops=3380, runt=300061msec
>     clat (usec): min=222, max=1033.9K, avg=2364.21, stdev=7609.07
>      lat (usec): min=222, max=1033.9K, avg=2364.37, stdev=7609.07
>     clat percentiles (usec):
>      |  1.00th=[  402],  5.00th=[  474], 10.00th=[  556],
> 20.00th=[  684], | 30.00th=[  772], 40.00th=[  828], 50.00th=[  876],
> 60.00th=[  924], | 70.00th=[  980], 80.00th=[ 1048], 90.00th=[ 1304],
> 95.00th=[ 9408], | 99.00th=[40704], 99.50th=[55040], 99.90th=[85504],
> 99.95th=[98816], | 99.99th=[130560]
>     bw (KB  /s): min=   16, max= 3096, per=12.53%, avg=1694.57,
> stdev=375.66 lat (usec) : 250=0.01%, 500=6.78%, 750=19.91%, 1000=46.99%
>     lat (msec) : 2=18.68%, 4=0.77%, 10=2.07%, 20=2.00%, 50=2.17%
>     lat (msec) : 100=0.58%, 250=0.05%, 500=0.01%, 2000=0.01%
>   cpu          : usr=0.19%, sys=0.56%, ctx=1014562, majf=0, minf=3463
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0% submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0% issued    : total=r=1014228/w=0/d=0, short=r=0/w=0/d=0,
> >drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=10
> write4k-seq: (groupid=4, jobs=8): err= 0: pid=1012: Tue Feb 23 14:28:45
> 2016 write: io=417684KB, bw=1392.2KB/s, iops=348, runt=300025msec
>     clat (msec): min=1, max=961, avg=22.98, stdev=61.67
>      lat (msec): min=1, max=961, avg=22.98, stdev=61.67
>     clat percentiles (msec):
>      |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3],
> 20.00th=[    3], | 30.00th=[    3], 40.00th=[    3], 50.00th=[    3],
> 60.00th=[    3], | 70.00th=[    4], 80.00th=[    4], 90.00th=[   76],
> 95.00th=[  151], | 99.00th=[  310], 99.50th=[  379], 99.90th=[  529],
> 99.95th=[  586], | 99.99th=[  791]
>     bw (KB  /s): min=    4, max= 1568, per=12.88%, avg=179.28,
> stdev=149.54 lat (msec) : 2=0.02%, 4=84.42%, 10=0.15%, 20=0.02%, 50=2.71%
>     lat (msec) : 100=4.71%, 250=6.14%, 500=1.70%, 750=0.11%, 1000=0.01%
>   cpu          : usr=0.04%, sys=0.26%, ctx=208926, majf=0, minf=223
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0% submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0% issued    : total=r=0/w=104421/d=0, short=r=0/w=0/d=0,
> >drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=10
> write4k-rand: (groupid=5, jobs=8): err= 0: pid=5210: Tue Feb 23 14:28:45
> 2016 write: io=358724KB, bw=1195.7KB/s, iops=298, runt=300025msec
>     clat (msec): min=2, max=1424, avg=26.76, stdev=45.71
>      lat (msec): min=2, max=1424, avg=26.76, stdev=45.71
>     clat percentiles (msec):
>      |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3],
> 20.00th=[    3], | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4],
> 60.00th=[    4], | 70.00th=[    5], 80.00th=[   62], 90.00th=[   92],
> 95.00th=[  116], | 99.00th=[  165], 99.50th=[  192], 99.90th=[  330],
> 99.95th=[  506], | 99.99th=[ 1074]
>     bw (KB  /s): min=    3, max= 1045, per=12.62%, avg=150.75,
> stdev=72.67 lat (msec) : 4=66.90%, 10=4.74%, 20=0.04%, 50=4.10%,
> 100=16.32% lat (msec) : 250=7.71%, 500=0.12%, 750=0.03%, 1000=0.02%,
> 2000=0.01% cpu          : usr=0.04%, sys=0.22%, ctx=181061, majf=0,
> minf=3460 IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%,
> 32=0.0%, >=64=0.0% submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%,
> 32=0.0%, 64=0.0%, >=64=0.0% complete  : 0=0.0%, 4=100.0%, 8=0.0%,
> 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued    : total=r=0/w=89681/d=0,
> short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency   : target=0, window=0,
> percentile=100.00%, depth=10
> 
> Run status group 0 (all jobs):
>    READ: io=32201MB, aggrb=109910KB/s, minb=109910KB/s, maxb=109910KB/s,
> mint=300007msec, maxt=300007msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=10058MB, aggrb=34329KB/s, minb=34329KB/s, maxb=34329KB/s,
> mint=300018msec, maxt=300018msec
> 
> Run status group 2 (all jobs):
>    READ: io=12892MB, aggrb=44002KB/s, minb=44002KB/s, maxb=44002KB/s,
> mint=300002msec, maxt=300002msec
> 
> Run status group 3 (all jobs):
>    READ: io=3961.9MB, aggrb=13520KB/s, minb=13520KB/s, maxb=13520KB/s,
> mint=300061msec, maxt=300061msec
> 
> Run status group 4 (all jobs):
>   WRITE: io=417684KB, aggrb=1392KB/s, minb=1392KB/s, maxb=1392KB/s,
> mint=300025msec, maxt=300025msec
> 
> Run status group 5 (all jobs):
>   WRITE: io=358724KB, aggrb=1195KB/s, minb=1195KB/s, maxb=1195KB/s,
> mint=300025msec, maxt=300025msec
> 
> Disk stats (read/write):
>   rbd0: ios=4378850/204497, merge=0/10238, ticks=5213536/1162412,
> in_queue=6374696, util=99.44%
> _______________________________________________ ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx  Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com