Re: Monitor Restart triggers half of our OSDs marked down

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 5 Feb 2015 11:15:54 +0100

On Thu, Feb 5, 2015 at 9:54 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 5 Feb 2015, Dan van der Ster wrote:
>> Hi,
>> We also have seen this once after upgrading to 0.80.8 (from dumpling).
>> Last week we had a network outage which marked out around 1/3rd of our
>> OSDs. The outage lasted less than a minute -- all the OSDs were
>> brought up once the network was restored.
>>
>> Then 30 minutes later I restarted one monitor to roll out a small
>> config change (changing leveldb log path). Surprisingly that resulted
>> in many OSDs (but seemingly fewer than before) being marked out again
>> then quickly marked in again.
>
> Did the 'wrongly marked down' messages appear in ceph.log?
>
>> I only have the lowest level logs from this incident -- but I think
>> it's easily reproducable.
>
> Logs with debug ms = 1 and debug mon = 20 would be best if someone is able
> to reproduce this.

I can reproduce using iptables to kill the network for 60s on one of
our OSD hosts. Here are the logs with ms=1 mon=20:
  https://www.dropbox.com/s/vdzl005n2qiwlee/ceph.log.gz?dl=0
  https://www.dropbox.com/s/to26i8k11vp9t8k/ceph-mon.0.log.gz?dl=0
  https://www.dropbox.com/s/j5e3rujs7qjouzh/ceph-mon.2.log.gz?dl=0

The badness happens after mon.2 is restarted:

2015-02-05 10:54:31.456887 mon.0 128.142.35.220:6789/0 602775 : [INF]
osd.20 128.142.23.53:6850/57083 failed (3 reports from 3 peers after
41.616656 >= grace 38.742061)
2015-02-05 10:54:31.457036 mon.0 128.142.35.220:6789/0 602776 : [INF]
osd.21 128.142.23.53:6870/50055 failed (5 reports from 4 peers after
39.614710 >= grace 39.553689)
2015-02-05 10:54:31.457092 mon.0 128.142.35.220:6789/0 602777 : [INF]
osd.22 128.142.23.53:6831/45065 failed (5 reports from 4 peers after
45.615582 >= grace 42.927456)

Cheers, Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com