Monitors repeatedly calling for new elections

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've just stood up a Ceph cluster for some experimentation.  Unfortunately, we're having some performance and stability problems I'm trying to pin down.  More unfortunately, I'm new to Ceph, so I'm not sure where to start looking for the problem.

Under activity, we'll get monitors going into election cycles repeatedly, OSD's being "wrongly marked down", as well as slow requests "osd.11 39.7.48.6:6833/21938 failed (3 reports from 1 peers after 52.914693 >= grace 20.000000)" .  During this, ceph -w shows the cluster essentially idle.  None of the network, disks, or cpu's ever appear to max out.  It also doesn't appear to be the same OSD's, MON's, or node causing the problem.  Top reports all 128 GB RAM (negligible swap) in use on the storage nodes.  Only Ceph is running on the storage nodes.

We've configured 4 nodes for storage and have connected 2 identical nodes to this cluster to access the cluster storage over the kernel RBD driver.  MON's are configured on the first three storage nodes.

The nodes we're using are Dell R720xd:

2x1TB spinners configured in RAID for the OS
12x4TB spinners for OSD's (3.5 TB XFS + 10GB Journal partition on each disk)
2x Xeon E5-2620 CPU (/proc/cpuinfo reports 24 cores)
128GB RAM
Two networks (public+cluster), both over infiniband

Software:
SLES 11SP3, with some in house patching. (3.0.1 kernel, "ceph-client" backported from 3.10)
Ceph version: ceph-0.80.5-0.9.2, packaged by SUSE

Our ceph.conf is pretty simple (as is our configuration, I think):
fsid = c216d502-5179-49b8-9b6c-ffc2cdd29374
mon initial members = tvsaq1
mon host = 39.7.48.6

cluster network = 39.64.0.0/12
public network = 39.0.0.0/12
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 9000
filestore xattr use omap = true
osd crush update on start = false
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 4096
osd pool default pgp num = 4096


What sort of performance should we be getting out of a setup like this?

Any help would be appreciated, and I'd be happy to provide whatever logs, config files, etc are needed.  I'm sure we're doing something wrong, but I don't know what it is.

Bill
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux