Intermittent poor performance on 3 node cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

I'm using Ceph as a filestore for my nginx web server, in order to have shared storage, and redundancy with automatic failover.

The cluster is not high spec, but given my use case (lots of images) - I am very dissapointed with the current throughput I'm getting, and was hoping for some advice.

I'm using CephFS and the latest Dumpling version on Ubuntu Server 12.04

Server specs:

CephFS1, CephFS2:

Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz
12GB Ram
1x 2TB SATA XFS
1x 2TB SATA (For the journal)

Each server runs 1x OSD, 1x MON and 1x MDS.
A third server runs 1x MON for Paxos to work correctly.
All machines are connected via a gigabit switch.

The ceph config as follows:

[global]
fsid = 58b87152-5ce8-491e-ae9c-07caeea3fefb
mon_initial_members = lb1, cephfs1, cephfs2
mon_host = 192.168.1.58,192.168.1.70,192.168.1.72
auth_supported = cephx
osd_journal_size = 1024
filestore_xattr_use_omap = true

Osd dump:

epoch 750
fsid 58b87152-5ce8-491e-ae9c-07caeea3fefb
created 2013-09-12 13:13:02.695411
modified 2013-10-21 14:28:31.780838
flags

pool 0 'data' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0

max_osd 4
osd.0 up in weight 1 up_from 741 up_thru 748 down_at 739 last_clean_interval [614,738) 192.168.1.70:6802/12325 192.168.1.70:6803/12325 192.168.1.70:6804/12325 192.168.1.70:6805/12325 exists,up d59119d5-bccb-43ea-be64-9d2272605617 osd.1 up in weight 1 up_from 748 up_thru 748 down_at 745 last_clean_interval [20,744) 192.168.1.72:6800/4271 192.168.1.72:6801/4271 192.168.1.72:6802/4271 192.168.1.72:6803/4271 exists,up 930c097a-f68b-4f9c-a6a1-6787a1382a41

pg_temp 0.12 [1,0,3]
pg_temp 0.16 [1,0,3]
pg_temp 0.18 [1,0,3]
pg_temp 1.11 [1,0,3]
pg_temp 1.15 [1,0,3]
pg_temp 1.17 [1,0,3]

Slowdowns increase the load of my nginx servers to around 40, and access to the CephFS mount is incredibly slow. These slowdowns happen about once a week. I typically solve them by restarting the MDS.

When the cluster gets slow I see the following in my logs:

2013-10-21 14:33:54.079200 7f6301e10700 0 log [WRN] : slow request 30.281651 seconds old, received at 2013-10-21 14:33:23.797488: osd_op(mds.0.8:16266 100004094c4.00000000 [tmapup 0~0] 1.91102783 e750) v4 currently commit sent 013-10-21 14:33:54.079191 7f6301e10700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for > 30.281651 secs

Any advice? Would increasing the PG num for data and metadata help? Would moving the MDS to a host which does not also run an OSD be greatly beneficial?

Please let me know if you need more info.

Thank you,
Pieter
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux