Re: New cluster performance analysis

Jan Schermer <jan@xxxxxxxxxxx> · Wed, 2 Dec 2015 23:48:58 +0100

Let's take IOPS, assuming the spinners can do 50 (4k) synced sustained IOPS (I hope they can do more ^^), we should be around 50x84/3 = 1400 IOPS, which is far from rados bench (538) and fio (847). And surprisingly fio numbers are greater than rados.
I think the missing factor here is filesystem journal overhead - that would explain the strange numbers you are seeing and the low performance in rados bench - every filesystem metadata operation has to do at least one 1 (synced) OP to the journal and that's not only file creation but also file growth (or filling the holes). And that's on the OSD as well as on the client filesystem side(!).

To do a proper benchmark, fill the RBD mounted filesytem first with data completely and then try again with fio on a preallocated file. (and don't enable discard if that's supported)
Better yet, run fio on the block device itself but write it over with dd if=/dev/zero first.
I think you'll get bit different numbers then.
Of course whether that's representative of what your usage pattern might be is another story.

Can you tell us what workload should be running on this and what the expectations were?
Can you see someting maxed our while the benchmark is running? (CPU or drives?) Have you tried switching schedulers on the drives?

Jan

On 02 Dec 2015, at 22:33, Adrien Gillard <gillard.adrien@xxxxxxxxx> wrote:

Hi everyone, 

I am
currently testing our new cluster and I would like some feedback on the numbers
I am getting.

For the
hardware : 
7 x OSD : 2
x Intel 2640v3 (8x2.6GHz), 64B RAM, 2x10Gbits LACP for public net., 2x10Gbits
LACP for cluster net., MTU 9000
1 x MON : 2
x Intel 2630L (6x2GHz), 32GB RAM and Intel DC SSD, 2x10Gbits LACP for public
net., MTU 9000
2 x MON : VMs
(8 cores, 8GB RAM), backed by SSD

Journals
are 20GB partitions on SSD

The system
is CentOS 7.1 with stock kernel (3.10.0-229.20.1.el7.x86_64). No particular system optimizations.

Ceph is
Infernalis from Ceph repository  : ceph version 9.2.0
(bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

[cephadm@cph-adm-01  ~/scripts]$ ceph -s
    cluster
259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
     health HEALTH_OK
     monmap e1: 3 mons at
{clb-cph-frpar1-mon-02=x.x.x.2:6789/0,clb-cph-frpar2-mon-01=x.x.x.1:6789/0,clb-cph-frpar2-mon-03=x.x.x.3:6789/0}
            election epoch 62, quorum
0,1,2 clb-cph-frpar2-mon-01,clb-cph-frpar1-mon-02,clb-cph-frpar2-mon-03
     osdmap e844: 84 osds: 84 up, 84 in
            flags sortbitwise
      pgmap v111655: 3136 pgs, 3 pools, 3166 GB
data, 19220 kobjects
            8308 GB used, 297 TB / 305 TB avail
                3136 active+clean

My
ceph.conf :

[global]
fsid =
259f65a3-d6c8-4c90-a9c2-71d4c3c55cce
mon_initial_members = clb-cph-frpar2-mon-01,
clb-cph-frpar1-mon-02, clb-cph-frpar2-mon-03
mon_host = x.x.x.1,x.x.x.2,x.x.x.3
auth_cluster_required
= cephx
auth_service_required
= cephx
auth_client_required
= cephx
filestore_xattr_use_omap
= true
public
network = 10.25.25.0/24
cluster
network = 10.25.26.0/24
debug_lockdep
= 0/0
debug_context
= 0/0
debug_crush
= 0/0
debug_buffer
= 0/0
debug_timer
= 0/0
debug_filer
= 0/0
debug_objecter
= 0/0
debug_rados
= 0/0
debug_rbd =
0/0
debug_journaler
= 0/0
debug_objectcatcher
= 0/0
debug_client
= 0/0
debug_osd =
0/0
debug_optracker
= 0/0
debug_objclass
= 0/0
debug_filestore
= 0/0
debug_journal
= 0/0
debug_ms =
0/0
debug_monc
= 0/0
debug_tp =
0/0
debug_auth
= 0/0
debug_finisher
= 0/0
debug_heartbeatmap
= 0/0
debug_perfcounter
= 0/0
debug_asok
= 0/0
debug_throttle
= 0/0
debug_mon =
0/0
debug_paxos
= 0/0
debug_rgw =
0/0

[osd]
osd journal
size = 0
osd mount
options xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k"
filestore
min sync interval = 5
filestore
max sync interval = 15
filestore
queue max ops = 2048
filestore
queue max bytes = 1048576000
filestore
queue committing max ops = 4096
filestore
queue committing max bytes = 1048576000
filestore
op thread = 32
filestore
journal writeahead = true
filestore
merge threshold = 40
filestore
split multiple = 8

journal max write bytes = 1048576000
journal max write entries = 4096
journal queue max ops = 8092
journal queue max bytes = 1048576000

osd max
write size = 512
osd op
threads = 16
osd disk
threads = 2
osd op num
threads per shard = 3
osd op num
shards = 10
osd map
cache size = 1024
osd max
backfills = 1
osd
recovery max active = 2

I have set
up 2 pools : one for cache with 3x replication in front of an EC pool. At the
moment I am only interested in the cache pool, so no promotions/flushes/evictions
happen. 
(I know, I
am using the same set of OSD for hot and cold data, but in my use case they
should not be used at the same time.)

I am
accessing the cluster via RBD volumes mapped with the kernel module on CentOS 7.1.
These volumes are formatted in XFS on the clients.

The journal SSDs seem to perform quite well according to the results of Sebastien Han’s
benchmark suggestion (they are Sandisk) :
write:
io=22336MB, bw=381194KB/s, iops=95298, runt= 60001msec (this is for numjob=10)

Here are
the rados bench tests : 

rados bench -p rbdcache 120 write -b 4K -t 32 --no-cleanup

Total time run:         121.410763
Total writes made:      65357
Write size:             4096
Bandwidth (MB/sec):     2.1
Stddev Bandwidth:       0.597
Max bandwidth (MB/sec): 3.89
Min bandwidth (MB/sec): 0.00781
Average IOPS:           538
Stddev IOPS:            152
Max IOPS:               995
Min IOPS:               2
Average Latency:        0.0594
Stddev Latency:         0.18
Max latency:            2.82
Min latency:            0.00494

And the
results of the fio test with the following parameters :

[global]
size=8G
runtime=300
ioengine=libaio
invalidate=1
direct=1
sync=1
fsync=1
numjobs=32
rw=randwrite
name=4k-32-1-randwrite-libaio
blocksize=4K
iodepth=1
directory=/mnt/rbd
group_reporting=1

4k-32-1-randwrite-libaio: (groupid=0, jobs=32): err= 0: pid=20442: Wed Dec  2 21:38:30 2015
  write: io=992.11MB, bw=3389.3KB/s, iops=847, runt=300011msec
    slat (usec): min=5, max=4726, avg=40.32, stdev=41.28
    clat (msec): min=2, max=2208, avg=19.35, stdev=74.34
     lat (msec): min=2, max=2208, avg=19.39, stdev=74.34
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[    6], 80.00th=[    7], 90.00th=[   38], 95.00th=[   63],
     | 99.00th=[  322], 99.50th=[  570], 99.90th=[ 1074], 99.95th=[ 1221],
     | 99.99th=[ 1532]
    bw (KB  /s): min=    1, max=  448, per=3.64%, avg=123.48, stdev=102.09
    lat (msec) : 4=30.30%, 10=51.27%, 20=1.71%, 50=9.91%, 100=4.03%
    lat (msec) : 250=1.55%, 500=0.62%, 750=0.33%, 1000=0.16%
  cpu          : usr=0.09%, sys=0.25%, ctx=963114, majf=0, minf=928
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=254206/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=992.11MB, aggrb=3389KB/s, minb=3389KB/s, maxb=3389KB/s, mint=300011msec, maxt=300011msec

Disk stats (read/write):
  rbd0: ios=0/320813, merge=0/10001, ticks=0/5670847, in_queue=5677825, util=100.00%

And a job closer to what the actual workload would be (blocksize=200K, numjob=16, QD=32)

200k-16-32-randwrite-libaio: (groupid=0, jobs=16): err= 0: pid=4828: Wed Dec  2 18:58:53 2015
  write: io=47305MB, bw=161367KB/s, iops=806, runt=300189msec
    slat (usec): min=17, max=358430, avg=155.11, stdev=2361.49
    clat (msec): min=9, max=3584, avg=613.88, stdev=168.68
     lat (msec): min=10, max=3584, avg=614.04, stdev=168.66
    clat percentiles (msec):
     |  1.00th=[  375],  5.00th=[  469], 10.00th=[  502], 20.00th=[  537],
     | 30.00th=[  553], 40.00th=[  578], 50.00th=[  594], 60.00th=[  603],
     | 70.00th=[  627], 80.00th=[  652], 90.00th=[  701], 95.00th=[  881],
     | 99.00th=[ 1205], 99.50th=[ 1483], 99.90th=[ 2638], 99.95th=[ 2671],
     | 99.99th=[ 2999]
    bw (KB  /s): min=  260, max=18181, per=6.31%, avg=10189.40, stdev=2009.86
    lat (msec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.02%, 250=0.08%
    lat (msec) : 500=9.26%, 750=83.21%, 1000=4.09%
  cpu          : usr=0.22%, sys=0.55%, ctx=719279, majf=0, minf=433
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=242203/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: io=47305MB, aggrb=161367KB/s, minb=161367KB/s, maxb=161367KB/s, mint=300189msec, maxt=300189msec

Disk stats (read/write):
  rbd0: ios=1/287809, merge=0/18393, ticks=50/5887593, in_queue=5887504, util=99.91%

The 4k block performance does not interest me so much but is given as a reference. I am more looking for throughput, but anyway, the numbers seem quite low.

Let's take IOPS, assuming the spinners can do 50 (4k) synced sustained IOPS (I hope they can do more ^^), we should be around 50x84/3 = 1400 IOPS, which is far from rados bench (538) and fio (847). And surprisingly fio numbers are greater than rados.

So I don't know wether I am missing something here or if something is going wrong (maybe both !).

Any input would be very valuable.

Thank you,

Adrien

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com