On Thu, 13 Oct 2016, Henrik Korkuc wrote: > from status page it seems that Ceph didn't like networking problems. May we > find out some details what happened? Underprovisioned servers (RAM upgrades > were in there too)? Too much load on disks? Something else? > > This situation may be not pleasant but I feel that others can learn from it to > prevent such situations in the future. Yep. These VMs were backed by an old ceph cluster and the cluster fell over after a switch failed. Because it's a beta cluster that's due to be decommissioned shortly it wasn't upgraded from firefly. And because it's old the PGs were mistuned (way too many) and machines were underprovisioned on RAM (32GB for 12 OSDs; normally probably enough but not on a very large cluster with 1000+ OSDs and too many PGs). It fell into the somewhat familiar pattern of OSDs OOMing because of large OSDMaps due to a degraded cluster. The recovery was a bit tedious (tune osdmap caches way down, get all OSDs to catch up on maps and rejoin cluster) but it's a procedure that's been described on this list before. Once the core issue was identified it came back pretty quickly. Had the nodes had more RAM or had the PG counts been better tuned it would have been avoided, and had the cluster been upgraded it *might* have been avoided (hammer+ is more memory efficient, and newer versions have lower default map cache sizes). This was one of the very first large-scale clusters we ever built, so we've learned quite a bit since then. :) sage > > On 16-10-13 06:55, Dan Mick wrote: > > Everything should have been back some time ago (0000 UTC or thereabouts) > > > > On 10/11/2016 10:41 PM, Brian :: wrote: > > > Looks like they are having major challenges getting that ceph cluster > > > running again.. Still down. > > > > > > On Tuesday, October 11, 2016, Ken Dreyer <kdreyer@xxxxxxxxxx > > > <mailto:kdreyer@xxxxxxxxxx>> wrote: > > > > I think this may be related: > > > > > > > http://www.dreamhoststatus.com/2016/10/11/dreamcompute-us-east-1-cluster-service-disruption/ > > > > On Tue, Oct 11, 2016 at 5:57 AM, Sean Redmond <sean.redmond1@xxxxxxxxx > > > <mailto:sean.redmond1@xxxxxxxxx>> wrote: > > > > > Hi, > > > > > > > > > > Looks like the ceph website and related sub domains are giving errors > > > > > for > > > > > the last few hours. > > > > > > > > > > I noticed the below that I use are in scope. > > > > > > > > > > http://ceph.com/ > > > > > http://docs.ceph.com/ > > > > > http://download.ceph.com/ > > > > > http://tracker.ceph.com/ > > > > > > > > > > Thanks > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com