Re: ceph website problems?

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 13 Oct 2016 16:34:17 +0000 (UTC)

On Thu, 13 Oct 2016, Henrik Korkuc wrote:
> from status page it seems that Ceph didn't like networking problems. May we
> find out some details what happened? Underprovisioned servers (RAM upgrades
> were in there too)? Too much load on disks? Something else?
> 
> This situation may be not pleasant but I feel that others can learn from it to
> prevent such situations in the future.

Yep.

These VMs were backed by an old ceph cluster and the cluster fell over 
after a switch failed.  Because it's a beta cluster that's due to be 
decommissioned shortly it wasn't upgraded from firefly.  And because it's 
old the PGs were mistuned (way too many) and machines were 
underprovisioned on RAM (32GB for 12 OSDs; normally probably enough but 
not on a very large cluster with 1000+ OSDs and too many PGs).  It fell 
into the somewhat familiar pattern of OSDs OOMing because of large OSDMaps 
due to a degraded cluster.

The recovery was a bit tedious (tune osdmap caches way down, get all OSDs 
to catch up on maps and rejoin cluster) but it's a procedure that's been 
described on this list before.  Once the core issue was identified it came 
back pretty quickly.

Had the nodes had more RAM or had the PG counts been better tuned it would 
have been avoided, and had the cluster been upgraded it *might* have been 
avoided (hammer+ is more memory efficient, and newer versions have lower 
default map cache sizes).

This was one of the very first large-scale clusters we ever built, so 
we've learned quite a bit since then.  :)

sage

> 
> On 16-10-13 06:55, Dan Mick wrote:
> > Everything should have been back some time ago (0000 UTC or thereabouts)
> > 
> > On 10/11/2016 10:41 PM, Brian :: wrote:
> > > Looks like they are having major challenges getting that ceph cluster
> > > running again.. Still down.
> > > 
> > > On Tuesday, October 11, 2016, Ken Dreyer <kdreyer@xxxxxxxxxx
> > > <mailto:kdreyer@xxxxxxxxxx>> wrote:
> > > > I think this may be related:
> > > > 
> > > http://www.dreamhoststatus.com/2016/10/11/dreamcompute-us-east-1-cluster-service-disruption/
> > > > On Tue, Oct 11, 2016 at 5:57 AM, Sean Redmond <sean.redmond1@xxxxxxxxx
> > > <mailto:sean.redmond1@xxxxxxxxx>> wrote:
> > > > > Hi,
> > > > > 
> > > > > Looks like the ceph website and related sub domains are giving errors
> > > > > for
> > > > > the last few hours.
> > > > > 
> > > > > I noticed the below that I use are in scope.
> > > > > 
> > > > > http://ceph.com/
> > > > > http://docs.ceph.com/
> > > > > http://download.ceph.com/
> > > > > http://tracker.ceph.com/
> > > > > 
> > > > > Thanks
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com