Re: cephfs, low performances

Don Waterloo <don.waterloo@xxxxxxxxx> · Sun, 20 Dec 2015 22:39:19 -0500

On 20 December 2015 at 19:23, Francois Lafont <flafdivers@xxxxxxx> wrote:
On 20/12/2015 22:51, Don Waterloo wrote:

> All nodes have 10Gbps to each other

Even the link client node <---> cluster nodes?

> OSD:

> $ ceph osd tree

> ID WEIGHT  TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY

> -1 5.48996 root default

> -2 0.89999     host nubo-1

>  0 0.89999         osd.0         up  1.00000          1.00000

> -3 0.89999     host nubo-2

>  1 0.89999         osd.1         up  1.00000          1.00000

> -4 0.89999     host nubo-3

>  2 0.89999         osd.2         up  1.00000          1.00000

> -5 0.92999     host nubo-19

>  3 0.92999         osd.3         up  1.00000          1.00000

> -6 0.92999     host nubo-20

>  4 0.92999         osd.4         up  1.00000          1.00000

> -7 0.92999     host nubo-21

>  5 0.92999         osd.5         up  1.00000          1.00000

>

> Each contains 1 x Samsung 850 Pro 1TB SSD (on sata)

>

> Each are Ubuntu 15.10 running 4.3.0-040300-generic kernel.

> Each are running ceph 0.94.5-0ubuntu0.15.10.1

>

> nubo-1/nubo-2/nubo-3 are 2x X5650 @ 2.67GHz w/ 96GB ram.

> nubo-19/nubo-20/nubo-21 are 2x E5-2699 v3 @ 2.30GHz, w/ 576GB ram.

>

> the connections are to the chipset sata in each case.

> The fio test to the underlying xfs disk

> (e.g. cd /var/lib/ceph/osd/ceph-1; fio --randrepeat=1 --ioengine=libaio

> --direct=1 --gtod_reduce=1 --name=readwrite --filename=rw.data --bs=4k

> --iodepth=64 --size=5000MB --readwrite=randrw --rwmixread=50)

> shows ~22K IOPS on each disk.

>

> nubo-1/2/3 are also the mon and the mds:

> $ ceph status

>     cluster b23abffc-71c4-4464-9449-3f2c9fbe1ded

>      health HEALTH_OK

>      monmap e1: 3 mons at {nubo-1=

> 10.100.10.60:6789/0,nubo-2=10.100.10.61:6789/0,nubo-3=10.100.10.62:6789/0}

>             election epoch 1104, quorum 0,1,2 nubo-1,nubo-2,nubo-3

>      mdsmap e621: 1/1/1 up {0=nubo-3=up:active}, 2 up:standby

>      osdmap e2459: 6 osds: 6 up, 6 in

>       pgmap v127331: 840 pgs, 6 pools, 144 GB data, 107 kobjects

>             289 GB used, 5332 GB / 5622 GB avail

>                  840 active+clean

>   client io 0 B/s rd, 183 kB/s wr, 54 op/s

And you have "replica size == 3" in your cluster, correct?

Do you have specific mount options or specific options in ceph.conf concerning ceph-fuse?

So the hardware configuration of your cluster seems to me globally highly

better than my cluster (config given in my first message) because you have

10Gb links (between the client and the cluster I have just 1Gb) and you

have full SSD OSDs.

I have tried to put _all_ cephfs in my SSD: ie the pools "cephfsdata" _and_

"cephfsmetadata" are in the SSD. The performances are slightly improved because

I have ~670 iops now (with the fio command of my first message again) but it

still seems to me bad.

In fact, I'm curious to have the opinion of "cephfs" experts to know what

iops we can expect. If anaything, ~700 iops is a correct iops for our hardware

configuration and maybe we are searching a problem which doesn't exist...

All nodes are interconnected on 10G (actually 8x10G, so 80Gbps, but i have 7 disabled for this test). I have done a 'iperf' w/ TCP and verified I can achieve ~9Gbps between each pair. I have jumbo frames enabled (so 9000 MTU, 8982 route mtu).

i have replica 2.

My 2 cephfs pools are:

pool 12 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 2239 flags hashpspool stripe_width 0
pool 13 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 2243 flags hashpspool crash_replay_interval 45 stripe_width 0

w/ cephfs-fuse, i used default except added noatime.

My ceph.conf is:

[global]
fsid = XXXX
mon_initial_members = nubo-2, nubo-3, nubo-1
mon_host = 10.100.10.61,10.100.10.62,10.100.10.60
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 2
public_network = 10.100.10.0/24
osd op threads = 6
osd disk threads = 6

[mon]
    mon clock drift allowed = .600

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com