On Thu, Mar 16, 2017 at 02:22:08AM +0000, Rich Rocque wrote: > Has anyone else run into this or have any suggestions on how to remedy it? We need a LOT more info. > After a couple months of almost no issues, our Ceph cluster has > started to have frequent failures. Just this week it's failed about > three times. > > The issue appears to be than an MDS or Monitor will fail and then all > clients hang. After that, all clients need to be forcibly restarted. - Can you define monitor 'failing' in this case? - What do the logs contain? - Is it running out of memory? - Can you turn up the debug level? - Has your cluster experienced continual growth and now might be undersized in some regard? > The architecture for our setup is: Are these virtual machines? The overall specs seem rather like VM instances rather than hardware. > 3 ea MON, MDS instances (co-located) on 2cpu, 4GB RAM servers What sort of SSD are the monitor datastores on? ('mon data' in the config) > 12 ea OSDs (ssd), on 1cpu, 1GB RAM servers 12 SSDs to a single server, with 1cpu/1GB RAM? That's absurdly low-spec. How many OSD servers, what SSDs? What is the network setup & connectivity between them (hopefully 10Gbit). -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer E-Mail : robbat2@xxxxxxxxxx GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com