On Thu, Feb 5, 2015 at 9:54 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Thu, 5 Feb 2015, Dan van der Ster wrote: >> Hi, >> We also have seen this once after upgrading to 0.80.8 (from dumpling). >> Last week we had a network outage which marked out around 1/3rd of our >> OSDs. The outage lasted less than a minute -- all the OSDs were >> brought up once the network was restored. >> >> Then 30 minutes later I restarted one monitor to roll out a small >> config change (changing leveldb log path). Surprisingly that resulted >> in many OSDs (but seemingly fewer than before) being marked out again >> then quickly marked in again. > > Did the 'wrongly marked down' messages appear in ceph.log? > Yes. Here's the start of the real outage: 2015-01-29 10:59:52.452367 mon.0 128.142.35.220:6789/0 417 : [INF] pgmap v35845354: 24608 pgs: 10 active+clean+scrubbing+deep, 24596 active+clean, 2 active+clean+scrubbing; 125 TB data, 377 TB used, 2021 TB / 2399 TB avail; 5137 kB/s rd, 31239 kB/s wr, 641 op/s 2015-01-29 10:59:52.618591 mon.0 128.142.35.220:6789/0 431 : [INF] osd.1129 128.142.23.100:6912/90010 failed (3 reports from 3 peers after 20.000159 >= grace 20.000000) 2015-01-29 10:59:52.939018 mon.0 128.142.35.220:6789/0 479 : [INF] osd.1080 128.142.23.111:6869/124575 failed (3 reports from 3 peers after 20.000181 >= grace 20.000000) 2015-01-29 10:59:53.147616 mon.0 128.142.35.220:6789/0 510 : [INF] osd.1113 128.142.23.107:6810/25247 failed (3 reports from 3 peers after 20.525957 >= grace 20.000000) 2015-01-29 10:59:53.342428 mon.0 128.142.35.220:6789/0 538 : [INF] osd.1136 128.142.23.100:6864/86061 failed (3 reports from 3 peers after 20.403032 >= grace 20.000000) 2015-01-29 10:59:53.342557 mon.0 128.142.35.220:6789/0 540 : [INF] osd.1144 128.142.23.100:6883/91968 failed (3 reports from 3 peers after 20.403104 >= grace 20.000000) Then many "wrongly marked me down" messages around a minute later when the network came back. But then when I restarted the (peon) monitor: 2015-01-29 11:29:18.250750 mon.0 128.142.35.220:6789/0 10570 : [INF] pgmap v35847068: 24608 pgs: 1 active+clean+scrubbing+deep, 24602 active+clean, 5 active+clean+scrubbing; 125 T B data, 377 TB used, 2021 TB / 2399 TB avail; 193 MB/s rd, 238 MB/s wr, 7410 op/s 2015-01-29 11:29:28.844678 mon.3 128.142.39.77:6789/0 1 : [INF] mon.2 calling new monitor election 2015-01-29 11:29:33.846946 mon.2 128.142.36.229:6789/0 9 : [INF] mon.4 calling new monitor election 2015-01-29 11:29:33.847022 mon.4 128.142.39.144:6789/0 7 : [INF] mon.3 calling new monitor election 2015-01-29 11:29:33.847085 mon.1 128.142.36.227:6789/0 24 : [INF] mon.1 calling new monitor election 2015-01-29 11:29:33.853498 mon.3 128.142.39.77:6789/0 2 : [INF] mon.2 calling new monitor election 2015-01-29 11:29:33.895660 mon.0 128.142.35.220:6789/0 10860 : [INF] mon.0 calling new monitor election 2015-01-29 11:29:33.901335 mon.0 128.142.35.220:6789/0 10861 : [INF] mon.0@0 won leader election with quorum 0,1,2,3,4 2015-01-29 11:29:34.004028 mon.0 128.142.35.220:6789/0 10862 : [INF] monmap e5: 5 mons at {0=128.142.35.220:6789/0,1=128.142.36.227:6789/0,2=128.142.39.77:6789/0,3=128.142.39.144:6789/0,4=128.142.36.229:6789/0} 2015-01-29 11:29:34.005808 mon.0 128.142.35.220:6789/0 10863 : [INF] pgmap v35847069: 24608 pgs: 1 active+clean+scrubbing+deep, 24602 active+clean, 5 active+clean+scrubbing; 125 TB data, 377 TB used, 2021 TB / 2399 TB avail; 54507 kB/s rd, 85412 kB/s wr, 1967 op/s 2015-01-29 11:29:34.006111 mon.0 128.142.35.220:6789/0 10864 : [INF] mdsmap e157: 1/1/1 up {0=0=up:active} 2015-01-29 11:29:34.007165 mon.0 128.142.35.220:6789/0 10865 : [INF] osdmap e132055: 880 osds: 880 up, 880 in 2015-01-29 11:29:34.037367 mon.0 128.142.35.220:6789/0 11055 : [INF] osd.1202 128.142.23.104:6801/98353 failed (4 reports from 3 peers after 29.673699 >= grace 28.948726) 2015-01-29 11:29:34.050478 mon.0 128.142.35.220:6789/0 11139 : [INF] osd.1164 128.142.23.102:6850/22486 failed (3 reports from 2 peers after 30.685537 >= grace 28.946983) and then just after: 2015-01-29 11:29:35.210184 osd.1202 128.142.23.104:6801/98353 59 : [WRN] map e132056 wrongly marked me down 2015-01-29 11:29:35.441922 osd.1164 128.142.23.102:6850/22486 25 : [WRN] map e132056 wrongly marked me down plus many other OSDs like that. Here is the full ceph.log from that day: https://www.dropbox.com/s/kzb2e3h3uirhejm/ceph.log-20150130.gz?dl=0 And here is ceph-mon.1.log: https://www.dropbox.com/s/w1mx8rcflffp5e8/ceph-mon.0.log-20150130.gz?dl=0 It has a bit more info (prepare_failure, 'ignoring stats from non-active osd') but alas I had only debug mon = 1 Cheers, Dan >> I only have the lowest level logs from this incident -- but I think >> it's easily reproducable. > > Logs with debug ms = 1 and debug mon = 20 would be best if someone is able > to reproduce this. I think our QA wouldn't capture this case because when > we are either thrashing monitors and not OSDs (so OSD is never marked > down) or we are thrashing OSDs and not mons (and ignoring > wrongly-marked-down events). > > sage > >> >> Cheers, Dan >> >> >> On Wed, Feb 4, 2015 at 12:06 PM, Christian Eichelmann >> <christian.eichelmann@xxxxxxxx> wrote: >> > Hi Greg, >> > >> > the behaviour is indeed strange. Today I was trying to reproduce the >> > problem, but no matter which monitor I've restarted, no matter how many >> > times, the bahviour was like expected: A new monitor election was called >> > and everything contiuned normally. >> > >> > Then I continued my failover tests and simulated the failure of two >> > racks with iptables (for us: 2 MON and & 6 OSD Server with in sum 360 OSDs) >> > >> > Afterwards I tried again to restart one monitor and again about 240 OSDs >> > got marked as down. >> > >> > There was no load on our monitor servers in that period. On one of the >> > OSDs which got marked down I found lot's of those messages: >> > >> > 2015-02-04 11:55:22.788245 7fc48fa48700 0 -- 10.76.70.4:6997/17094790 >> >>> 10.76.70.8:6806/3303244 pipe(0x7a1b600 sd=198 :59766 s=2 pgs=1353 >> > cs=1 l=0 c=0x4e562c0).fault with nothing to send, going to standby >> > 2015-02-04 11:55:22.788371 7fc48be0c700 0 -- 10.76.70.4:6997/17094790 >> >>> 10.76.70.8:6842/12012876 pipe(0x895e840 sd=188 :49283 s=2 pgs=36873 >> > cs=1 l=0 c=0x13226f20).fault with nothing to send, going to standby >> > 2015-02-04 11:55:22.788458 7fc494e9c700 0 -- 10.76.70.4:6997/17094790 >> >>> 10.76.70.13:6870/13021609 pipe(0x13ace2c0 sd=117 :64130 s=2 pgs=38011 >> > cs=1 l=0 c=0x52b4840).fault with nothing to send, going to standby >> > 2015-02-04 11:55:22.797107 7fc46459d700 0 -- 10.76.70.4:0/94790 >> >> > 10.76.70.11:6980/37144571 pipe(0xba0c580 sd=30 :0 s=1 pgs=0 cs=0 l=1 >> > c=0x4e51600).fault >> > 2015-02-04 11:55:22.799350 7fc482d7d700 0 -- 10.76.70.4:6997/17094790 >> >>> 10.76.70.10:6887/30410592 pipe(0x6a0cb00 sd=271 :53090 s=2 pgs=15372 >> > cs=1 l=0 c=0xf3a6f20).fault with nothing to send, going to standby >> > 2015-02-04 11:55:22.800018 7fc46429a700 0 -- 10.76.70.4:0/94790 >> >> > 10.76.28.41:7076/37144571 pipe(0xba0c840 sd=59 :0 s=1 pgs=0 cs=0 l=1 >> > c=0xf339760).fault >> > 2015-02-04 11:55:22.803086 7fc482272700 0 -- 10.76.70.4:6997/17094790 >> >>> 10.76.70.5:6867/17011547 pipe(0x12f998c0 sd=294 :6997 s=2 pgs=46095 >> > cs=1 l=0 c=0x8382000).fault with nothing to send, going to standby >> > 2015-02-04 11:55:22.804736 7fc4892e1700 0 -- 10.76.70.4:6997/17094790 >> >>> 10.76.70.13:6852/9142109 pipe(0x12fa5b80 sd=163 :57056 s=2 pgs=45269 >> > cs=1 l=0 c=0x189d1600).fault with nothing to send, going to standby >> > >> > The IPs mentioned there are all OSD Server. >> > >> > For me it feels like the monitors still have some "memory" about the >> > failed OSDs and something is happening when one of the goes down. If I >> > can provide you any more information to clarify the issue, just tell me >> > what you need. >> > >> > Regards, >> > Christian >> > >> > Am 03.02.2015 18:10, schrieb Gregory Farnum: >> >> On Tue, Feb 3, 2015 at 3:38 AM, Christian Eichelmann >> >> <christian.eichelmann@xxxxxxxx> wrote: >> >>> Hi all, >> >>> >> >>> during some failover tests and some configuration tests, we currently >> >>> discover a strange phenomenon: >> >>> >> >>> Restarting one of our monitors (5 in sum) triggers about 300 of the >> >>> following events: >> >>> >> >>> osd.669 10.76.28.58:6935/149172 failed (20 reports from 20 peers after >> >>> 22.005858 >= grace 20.000000) >> >>> >> >>> The osds come back up shortly after the have been marked down. What I >> >>> don't understand is: How can a restart of one monitor prevent the osds >> >>> from talking to each other and marking them down? >> >>> >> >>> FYI: >> >>> We are currently using the following settings: >> >>> mon osd adjust hearbeat grace = false >> >>> mon osd min down reporters = 20 >> >>> mon osd adjust down out interval = false >> >> >> >> That's really strange. I think maybe you're seeing some kind of >> >> secondary effect; what kind of CPU usage are you seeing on the >> >> monitors during this time? Have you checked the log on any OSDs which >> >> have been marked down? >> >> >> >> I have a suspicion that maybe the OSDs are detecting their failed >> >> monitor connection and not being able to reconnect to another monitor >> >> quickly enough, but I'm not certain what the overlaps are there. >> >> -Greg >> >> >> > >> > >> > -- >> > Christian Eichelmann >> > Systemadministrator >> > >> > 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting >> > Brauerstraße 48 · DE-76135 Karlsruhe >> > Telefon: +49 721 91374-8026 >> > christian.eichelmann@xxxxxxxx >> > >> > Amtsgericht Montabaur / HRB 6484 >> > Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert >> > Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen >> > Aufsichtsratsvorsitzender: Michael Scheeren >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users@xxxxxxxxxxxxxx >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com