Re: osdmap several thousand epochs behind latest

Wido den Hollander <wido@xxxxxxxx> · Tue, 11 Jul 2017 21:33:19 +0200 (CEST)

> Op 10 juli 2017 om 2:06 schreef Chris Apsey <bitskrieg@xxxxxxxxxxxxx>:
> 
> 
> All,
> 
> Had a fairly substantial network interruption that knocked out about 
> ~270 osds:
> 
>       health HEALTH_ERR
>              [...]
>              273/384 in osds are down
>              noup,nodown,noout flag(s) set
>       monmap e2: 3 mons at 
> {cephmon-0=10.10.6.0:6789/0,cephmon-1=10.10.6.1:6789/0,cephmon-2=10.10.6.2:6789/0}
>              election epoch 138, quorum 0,1,2 
> cephmon-0,cephmon-1,cephmon-2
>          mgr no daemons active
>       osdmap e37718: 384 osds: 111 up, 384 in; 16764 remapped pgs
>              flags 
> noup,nodown,noout,sortbitwise,require_jewel_osds,require_kraken_osds
> 
> We've had network interruptions before, and normally OSDs come back on 
> their own, or do so with a service restart.  This time, no such luck 
> (I'm guessing the scale was just too much).  After a few hours of trying 
> to figure out why OSD services were running on the hosts (according to 
> systemd) but marked 'down' in ceph osd tree, I found this thread: 
> http://ceph-devel.vger.kernel.narkive.com/ftEN7TOU/70-osd-are-down-and-not-coming-up 
> which appears to perfectly describe the scenario (high CPU usage, osdmap 
> way out of sync, etc.)
> 
> I've taken the steps outlined and set the appropriate flags and am 
> monitoring the 'catch up' progress of the OSDs.  The OSD farthest behind 
> is about 5000 epochs out of sync, so I assume it will be a few hours 
> before I see CPU usage level out.
> 
> Once the OSDs are caught up, are there any other steps I should take 
> before 'ceph osd unset noup' (or anything to do after)?
> 

Probably not. Once they catch up they should be fine and join the cluster again.

If you query their status over the admin socket they should go from booting to 'active' as their status.

Wido

> Thanks in advance,
> 
> -- 
> v/r
> 
> Chris Apsey
> bitskrieg@xxxxxxxxxxxxx
> https://www.bitskrieg.net
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com