Re: Ceph Cluster Failures

"Robin H. Johnson" <robbat2@xxxxxxxxxx> · Thu, 16 Mar 2017 02:44:29 +0000

On Thu, Mar 16, 2017 at 02:22:08AM +0000, Rich Rocque wrote:
> Has anyone else run into this or have any suggestions on how to remedy it?
We need a LOT more info.

> After a couple months of almost no issues, our Ceph cluster has
> started to have frequent failures. Just this week it's failed about
> three times.
>
> The issue appears to be than an MDS or Monitor will fail and then all
> clients hang. After that, all clients need to be forcibly restarted.
- Can you define monitor 'failing' in this case? 
- What do the logs contain? 
- Is it running out of memory?
- Can you turn up the debug level?
- Has your cluster experienced continual growth and now might be
  undersized in some regard?

> The architecture for our setup is:
Are these virtual machines? The overall specs seem rather like VM
instances rather than hardware.

> 3 ea MON, MDS instances (co-located) on 2cpu, 4GB RAM servers
What sort of SSD are the monitor datastores on? ('mon data' in the
config)

> 12 ea OSDs (ssd), on 1cpu, 1GB RAM servers
12 SSDs to a single server, with 1cpu/1GB RAM? That's absurdly low-spec.
How many OSD servers, what SSDs?

What is the network setup & connectivity between them (hopefully
10Gbit).

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer
E-Mail   : robbat2@xxxxxxxxxx
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com