Hi, We also have seen this once after upgrading to 0.80.8 (from dumpling). Last week we had a network outage which marked out around 1/3rd of our OSDs. The outage lasted less than a minute -- all the OSDs were brought up once the network was restored. Then 30 minutes later I restarted one monitor to roll out a small config change (changing leveldb log path). Surprisingly that resulted in many OSDs (but seemingly fewer than before) being marked out again then quickly marked in again. I only have the lowest level logs from this incident -- but I think it's easily reproducable. Cheers, Dan On Wed, Feb 4, 2015 at 12:06 PM, Christian Eichelmann <christian.eichelmann@xxxxxxxx> wrote: > Hi Greg, > > the behaviour is indeed strange. Today I was trying to reproduce the > problem, but no matter which monitor I've restarted, no matter how many > times, the bahviour was like expected: A new monitor election was called > and everything contiuned normally. > > Then I continued my failover tests and simulated the failure of two > racks with iptables (for us: 2 MON and & 6 OSD Server with in sum 360 OSDs) > > Afterwards I tried again to restart one monitor and again about 240 OSDs > got marked as down. > > There was no load on our monitor servers in that period. On one of the > OSDs which got marked down I found lot's of those messages: > > 2015-02-04 11:55:22.788245 7fc48fa48700 0 -- 10.76.70.4:6997/17094790 >>> 10.76.70.8:6806/3303244 pipe(0x7a1b600 sd=198 :59766 s=2 pgs=1353 > cs=1 l=0 c=0x4e562c0).fault with nothing to send, going to standby > 2015-02-04 11:55:22.788371 7fc48be0c700 0 -- 10.76.70.4:6997/17094790 >>> 10.76.70.8:6842/12012876 pipe(0x895e840 sd=188 :49283 s=2 pgs=36873 > cs=1 l=0 c=0x13226f20).fault with nothing to send, going to standby > 2015-02-04 11:55:22.788458 7fc494e9c700 0 -- 10.76.70.4:6997/17094790 >>> 10.76.70.13:6870/13021609 pipe(0x13ace2c0 sd=117 :64130 s=2 pgs=38011 > cs=1 l=0 c=0x52b4840).fault with nothing to send, going to standby > 2015-02-04 11:55:22.797107 7fc46459d700 0 -- 10.76.70.4:0/94790 >> > 10.76.70.11:6980/37144571 pipe(0xba0c580 sd=30 :0 s=1 pgs=0 cs=0 l=1 > c=0x4e51600).fault > 2015-02-04 11:55:22.799350 7fc482d7d700 0 -- 10.76.70.4:6997/17094790 >>> 10.76.70.10:6887/30410592 pipe(0x6a0cb00 sd=271 :53090 s=2 pgs=15372 > cs=1 l=0 c=0xf3a6f20).fault with nothing to send, going to standby > 2015-02-04 11:55:22.800018 7fc46429a700 0 -- 10.76.70.4:0/94790 >> > 10.76.28.41:7076/37144571 pipe(0xba0c840 sd=59 :0 s=1 pgs=0 cs=0 l=1 > c=0xf339760).fault > 2015-02-04 11:55:22.803086 7fc482272700 0 -- 10.76.70.4:6997/17094790 >>> 10.76.70.5:6867/17011547 pipe(0x12f998c0 sd=294 :6997 s=2 pgs=46095 > cs=1 l=0 c=0x8382000).fault with nothing to send, going to standby > 2015-02-04 11:55:22.804736 7fc4892e1700 0 -- 10.76.70.4:6997/17094790 >>> 10.76.70.13:6852/9142109 pipe(0x12fa5b80 sd=163 :57056 s=2 pgs=45269 > cs=1 l=0 c=0x189d1600).fault with nothing to send, going to standby > > The IPs mentioned there are all OSD Server. > > For me it feels like the monitors still have some "memory" about the > failed OSDs and something is happening when one of the goes down. If I > can provide you any more information to clarify the issue, just tell me > what you need. > > Regards, > Christian > > Am 03.02.2015 18:10, schrieb Gregory Farnum: >> On Tue, Feb 3, 2015 at 3:38 AM, Christian Eichelmann >> <christian.eichelmann@xxxxxxxx> wrote: >>> Hi all, >>> >>> during some failover tests and some configuration tests, we currently >>> discover a strange phenomenon: >>> >>> Restarting one of our monitors (5 in sum) triggers about 300 of the >>> following events: >>> >>> osd.669 10.76.28.58:6935/149172 failed (20 reports from 20 peers after >>> 22.005858 >= grace 20.000000) >>> >>> The osds come back up shortly after the have been marked down. What I >>> don't understand is: How can a restart of one monitor prevent the osds >>> from talking to each other and marking them down? >>> >>> FYI: >>> We are currently using the following settings: >>> mon osd adjust hearbeat grace = false >>> mon osd min down reporters = 20 >>> mon osd adjust down out interval = false >> >> That's really strange. I think maybe you're seeing some kind of >> secondary effect; what kind of CPU usage are you seeing on the >> monitors during this time? Have you checked the log on any OSDs which >> have been marked down? >> >> I have a suspicion that maybe the OSDs are detecting their failed >> monitor connection and not being able to reconnect to another monitor >> quickly enough, but I'm not certain what the overlaps are there. >> -Greg >> > > > -- > Christian Eichelmann > Systemadministrator > > 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting > Brauerstraße 48 · DE-76135 Karlsruhe > Telefon: +49 721 91374-8026 > christian.eichelmann@xxxxxxxx > > Amtsgericht Montabaur / HRB 6484 > Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert > Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen > Aufsichtsratsvorsitzender: Michael Scheeren > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com