Re: Monitor Restart triggers half of our OSDs marked down

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 5 Feb 2015 03:11:15 -0800 (PST)

On Thu, 5 Feb 2015, Dan van der Ster wrote:
> On Thu, Feb 5, 2015 at 9:54 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > On Thu, 5 Feb 2015, Dan van der Ster wrote:
> >> Hi,
> >> We also have seen this once after upgrading to 0.80.8 (from dumpling).
> >> Last week we had a network outage which marked out around 1/3rd of our
> >> OSDs. The outage lasted less than a minute -- all the OSDs were
> >> brought up once the network was restored.
> >>
> >> Then 30 minutes later I restarted one monitor to roll out a small
> >> config change (changing leveldb log path). Surprisingly that resulted
> >> in many OSDs (but seemingly fewer than before) being marked out again
> >> then quickly marked in again.
> >
> > Did the 'wrongly marked down' messages appear in ceph.log?
> >
> >> I only have the lowest level logs from this incident -- but I think
> >> it's easily reproducable.
> >
> > Logs with debug ms = 1 and debug mon = 20 would be best if someone is able
> > to reproduce this.
> 
> I can reproduce using iptables to kill the network for 60s on one of
> our OSD hosts. Here are the logs with ms=1 mon=20:
>   https://www.dropbox.com/s/vdzl005n2qiwlee/ceph.log.gz?dl=0
>   https://www.dropbox.com/s/to26i8k11vp9t8k/ceph-mon.0.log.gz?dl=0
>   https://www.dropbox.com/s/j5e3rujs7qjouzh/ceph-mon.2.log.gz?dl=0
> 
> The badness happens after mon.2 is restarted:
> 
> 2015-02-05 10:54:31.456887 mon.0 128.142.35.220:6789/0 602775 : [INF]
> osd.20 128.142.23.53:6850/57083 failed (3 reports from 3 peers after
> 41.616656 >= grace 38.742061)
> 2015-02-05 10:54:31.457036 mon.0 128.142.35.220:6789/0 602776 : [INF]
> osd.21 128.142.23.53:6870/50055 failed (5 reports from 4 peers after
> 39.614710 >= grace 39.553689)
> 2015-02-05 10:54:31.457092 mon.0 128.142.35.220:6789/0 602777 : [INF]
> osd.22 128.142.23.53:6831/45065 failed (5 reports from 4 peers after
> 45.615582 >= grace 42.927456)

Yep, it's a silly bug and I'm surprised we haven't noticed until now!

	http://tracker.ceph.com/issues/10762
	https://github.com/ceph/ceph/pull/3631

Thanks!
sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com