Intermittent poor performance on 3 node cluster

Pieter Steyn <pieter@xxxxxxxxxx> · Mon, 21 Oct 2013 17:05:43 +0200

Hi all,

I'm using Ceph as a filestore for my nginx web server, in order to have 
shared storage, and redundancy with automatic failover.

The cluster is not high spec, but given my use case (lots of images) - I 
am very dissapointed with the current throughput I'm getting, and was 
hoping for some advice.

I'm using CephFS and the latest Dumpling version on Ubuntu Server 12.04

Server specs:

CephFS1, CephFS2:

Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz
12GB Ram
1x 2TB SATA XFS
1x 2TB SATA (For the journal)

Each server runs 1x OSD, 1x MON and 1x MDS.
A third server runs 1x MON for Paxos to work correctly.
All machines are connected via a gigabit switch.

The ceph config as follows:

[global]
fsid = 58b87152-5ce8-491e-ae9c-07caeea3fefb
mon_initial_members = lb1, cephfs1, cephfs2
mon_host = 192.168.1.58,192.168.1.70,192.168.1.72
auth_supported = cephx
osd_journal_size = 1024
filestore_xattr_use_omap = true

Osd dump:

epoch 750
fsid 58b87152-5ce8-491e-ae9c-07caeea3fefb
created 2013-09-12 13:13:02.695411
modified 2013-10-21 14:28:31.780838
flags

pool 0 'data' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0
pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 owner 0

max_osd 4
osd.0 up   in  weight 1 up_from 741 up_thru 748 down_at 739 
last_clean_interval [614,738) 192.168.1.70:6802/12325 
192.168.1.70:6803/12325 192.168.1.70:6804/12325 192.168.1.70:6805/12325 
exists,up d59119d5-bccb-43ea-be64-9d2272605617
osd.1 up   in  weight 1 up_from 748 up_thru 748 down_at 745 
last_clean_interval [20,744) 192.168.1.72:6800/4271 
192.168.1.72:6801/4271 192.168.1.72:6802/4271 192.168.1.72:6803/4271 
exists,up 930c097a-f68b-4f9c-a6a1-6787a1382a41

pg_temp 0.12 [1,0,3]
pg_temp 0.16 [1,0,3]
pg_temp 0.18 [1,0,3]
pg_temp 1.11 [1,0,3]
pg_temp 1.15 [1,0,3]
pg_temp 1.17 [1,0,3]

Slowdowns increase the load of my nginx servers to around 40, and access 
to the CephFS mount is incredibly slow.  These slowdowns happen about 
once a week.  I typically solve them by restarting the MDS.

When the cluster gets slow I see the following in my logs:

2013-10-21 14:33:54.079200 7f6301e10700  0 log [WRN] : slow request 
30.281651 seconds old, received at 2013-10-21 14:33:23.797488: 
osd_op(mds.0.8:16266 100004094c4.00000000 [tmapup 0~0] 1.91102783 e750) 
v4 currently commit sent
013-10-21 14:33:54.079191 7f6301e10700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for > 30.281651 secs

Any advice? Would increasing the PG num for data and metadata help? 
Would moving the MDS to a host which does not also run an OSD be greatly 
beneficial?

Please let me know if you need more info.

Thank you,
Pieter
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com