I've just stood up a Ceph cluster for some experimentation. Unfortunately, we're having some performance and stability problems I'm trying to pin down. More unfortunately, I'm new to Ceph, so I'm not sure where to start looking for the problem.
Under activity, we'll get monitors going into election cycles repeatedly, OSD's being "wrongly marked down", as well as slow requests "osd.11 39.7.48.6:6833/21938 failed (3 reports from 1 peers after 52.914693 >= grace 20.000000)" . During this, ceph -w shows the cluster essentially idle. None of the network, disks, or cpu's ever appear to max out. It also doesn't appear to be the same OSD's, MON's, or node causing the problem. Top reports all 128 GB RAM (negligible swap) in use on the storage nodes. Only Ceph is running on the storage nodes. We've configured 4 nodes for storage and have connected 2 identical nodes to this cluster to access the cluster storage over the kernel RBD driver. MON's are configured on the first three storage nodes. The nodes we're using are Dell R720xd: 2x1TB spinners configured in RAID for the OS 12x4TB spinners for OSD's (3.5 TB XFS + 10GB Journal partition on each disk) 2x Xeon E5-2620 CPU (/proc/cpuinfo reports 24 cores) 128GB RAM Two networks (public+cluster), both over infiniband Software: SLES 11SP3, with some in house patching. (3.0.1 kernel, "ceph-client" backported from 3.10) Ceph version: ceph-0.80.5-0.9.2, packaged by SUSE Our ceph.conf is pretty simple (as is our configuration, I think): fsid = c216d502-5179-49b8-9b6c-ffc2cdd29374 mon initial members = tvsaq1 mon host = 39.7.48.6 cluster network = 39.64.0.0/12 public network = 39.0.0.0/12 auth cluster required = cephx auth service required = cephx auth client required = cephx osd journal size = 9000 filestore xattr use omap = true osd crush update on start = false osd pool default size = 3 osd pool default min size = 1 osd pool default pg num = 4096 osd pool default pgp num = 4096 What sort of performance should we be getting out of a setup like this? Any help would be appreciated, and I'd be happy to provide whatever logs, config files, etc are needed. I'm sure we're doing something wrong, but I don't know what it is. Bill |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com