On Thu, Feb 5, 2015 at 9:54 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Thu, 5 Feb 2015, Dan van der Ster wrote: >> Hi, >> We also have seen this once after upgrading to 0.80.8 (from dumpling). >> Last week we had a network outage which marked out around 1/3rd of our >> OSDs. The outage lasted less than a minute -- all the OSDs were >> brought up once the network was restored. >> >> Then 30 minutes later I restarted one monitor to roll out a small >> config change (changing leveldb log path). Surprisingly that resulted >> in many OSDs (but seemingly fewer than before) being marked out again >> then quickly marked in again. > > Did the 'wrongly marked down' messages appear in ceph.log? > >> I only have the lowest level logs from this incident -- but I think >> it's easily reproducable. > > Logs with debug ms = 1 and debug mon = 20 would be best if someone is able > to reproduce this. I can reproduce using iptables to kill the network for 60s on one of our OSD hosts. Here are the logs with ms=1 mon=20: https://www.dropbox.com/s/vdzl005n2qiwlee/ceph.log.gz?dl=0 https://www.dropbox.com/s/to26i8k11vp9t8k/ceph-mon.0.log.gz?dl=0 https://www.dropbox.com/s/j5e3rujs7qjouzh/ceph-mon.2.log.gz?dl=0 The badness happens after mon.2 is restarted: 2015-02-05 10:54:31.456887 mon.0 128.142.35.220:6789/0 602775 : [INF] osd.20 128.142.23.53:6850/57083 failed (3 reports from 3 peers after 41.616656 >= grace 38.742061) 2015-02-05 10:54:31.457036 mon.0 128.142.35.220:6789/0 602776 : [INF] osd.21 128.142.23.53:6870/50055 failed (5 reports from 4 peers after 39.614710 >= grace 39.553689) 2015-02-05 10:54:31.457092 mon.0 128.142.35.220:6789/0 602777 : [INF] osd.22 128.142.23.53:6831/45065 failed (5 reports from 4 peers after 45.615582 >= grace 42.927456) Cheers, Dan _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com