Hi Greg, the behaviour is indeed strange. Today I was trying to reproduce the problem, but no matter which monitor I've restarted, no matter how many times, the bahviour was like expected: A new monitor election was called and everything contiuned normally. Then I continued my failover tests and simulated the failure of two racks with iptables (for us: 2 MON and & 6 OSD Server with in sum 360 OSDs) Afterwards I tried again to restart one monitor and again about 240 OSDs got marked as down. There was no load on our monitor servers in that period. On one of the OSDs which got marked down I found lot's of those messages: 2015-02-04 11:55:22.788245 7fc48fa48700 0 -- 10.76.70.4:6997/17094790 >> 10.76.70.8:6806/3303244 pipe(0x7a1b600 sd=198 :59766 s=2 pgs=1353 cs=1 l=0 c=0x4e562c0).fault with nothing to send, going to standby 2015-02-04 11:55:22.788371 7fc48be0c700 0 -- 10.76.70.4:6997/17094790 >> 10.76.70.8:6842/12012876 pipe(0x895e840 sd=188 :49283 s=2 pgs=36873 cs=1 l=0 c=0x13226f20).fault with nothing to send, going to standby 2015-02-04 11:55:22.788458 7fc494e9c700 0 -- 10.76.70.4:6997/17094790 >> 10.76.70.13:6870/13021609 pipe(0x13ace2c0 sd=117 :64130 s=2 pgs=38011 cs=1 l=0 c=0x52b4840).fault with nothing to send, going to standby 2015-02-04 11:55:22.797107 7fc46459d700 0 -- 10.76.70.4:0/94790 >> 10.76.70.11:6980/37144571 pipe(0xba0c580 sd=30 :0 s=1 pgs=0 cs=0 l=1 c=0x4e51600).fault 2015-02-04 11:55:22.799350 7fc482d7d700 0 -- 10.76.70.4:6997/17094790 >> 10.76.70.10:6887/30410592 pipe(0x6a0cb00 sd=271 :53090 s=2 pgs=15372 cs=1 l=0 c=0xf3a6f20).fault with nothing to send, going to standby 2015-02-04 11:55:22.800018 7fc46429a700 0 -- 10.76.70.4:0/94790 >> 10.76.28.41:7076/37144571 pipe(0xba0c840 sd=59 :0 s=1 pgs=0 cs=0 l=1 c=0xf339760).fault 2015-02-04 11:55:22.803086 7fc482272700 0 -- 10.76.70.4:6997/17094790 >> 10.76.70.5:6867/17011547 pipe(0x12f998c0 sd=294 :6997 s=2 pgs=46095 cs=1 l=0 c=0x8382000).fault with nothing to send, going to standby 2015-02-04 11:55:22.804736 7fc4892e1700 0 -- 10.76.70.4:6997/17094790 >> 10.76.70.13:6852/9142109 pipe(0x12fa5b80 sd=163 :57056 s=2 pgs=45269 cs=1 l=0 c=0x189d1600).fault with nothing to send, going to standby The IPs mentioned there are all OSD Server. For me it feels like the monitors still have some "memory" about the failed OSDs and something is happening when one of the goes down. If I can provide you any more information to clarify the issue, just tell me what you need. Regards, Christian Am 03.02.2015 18:10, schrieb Gregory Farnum: > On Tue, Feb 3, 2015 at 3:38 AM, Christian Eichelmann > <christian.eichelmann@xxxxxxxx> wrote: >> Hi all, >> >> during some failover tests and some configuration tests, we currently >> discover a strange phenomenon: >> >> Restarting one of our monitors (5 in sum) triggers about 300 of the >> following events: >> >> osd.669 10.76.28.58:6935/149172 failed (20 reports from 20 peers after >> 22.005858 >= grace 20.000000) >> >> The osds come back up shortly after the have been marked down. What I >> don't understand is: How can a restart of one monitor prevent the osds >> from talking to each other and marking them down? >> >> FYI: >> We are currently using the following settings: >> mon osd adjust hearbeat grace = false >> mon osd min down reporters = 20 >> mon osd adjust down out interval = false > > That's really strange. I think maybe you're seeing some kind of > secondary effect; what kind of CPU usage are you seeing on the > monitors during this time? Have you checked the log on any OSDs which > have been marked down? > > I have a suspicion that maybe the OSDs are detecting their failed > monitor connection and not being able to reconnect to another monitor > quickly enough, but I'm not certain what the overlaps are there. > -Greg > -- Christian Eichelmann Systemadministrator 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting Brauerstraße 48 · DE-76135 Karlsruhe Telefon: +49 721 91374-8026 christian.eichelmann@xxxxxxxx Amtsgericht Montabaur / HRB 6484 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen Aufsichtsratsvorsitzender: Michael Scheeren _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com