Re: Monitor Restart triggers half of our OSDs marked down

Sahana Lokeshappa <Sahana.Lokeshappa@xxxxxxxxxxx> · Thu, 5 Feb 2015 09:20:47 +0000

Hi ,

Even I have seen same logs in osds , but the steps I followed were:

Setup : 4 osds-nodes node1,node2,node3,node4
Each node contains 8 osds.

Node1 got rebooted.

But osd in node2 went down.

Logs from monitor:

2015-02-02 20:28:28.766087 7fbaabea4700  1 mon.rack1-ram-6@0(leader).osd e248 prepare_failure osd.2 10.242.42.114:6871/25606 from osd.13 10.242.42.110:6835/25881 is reporting failure:1
2015-02-02 20:28:28.766099 7fbaabea4700  0 log_channel(cluster) log [DBG] : osd.2 10.242.42.114:6871/25606 reported failed by osd.13 10.242.42.110:6835/25881
2015-02-02 20:28:32.977235 7fbaabea4700  1 mon.rack1-ram-6@0(leader).osd e248 prepare_failure osd.2 10.242.42.114:6871/25606 from osd.23 10.242.42.112:6810/23707 is reporting failure:1
2015-02-02 20:28:32.977244 7fbaabea4700  0 log_channel(cluster) log [DBG] : osd.2 10.242.42.114:6871/25606 reported failed by osd.23 10.242.42.112:6810/23707
2015-02-02 20:28:33.012054 7fbaabea4700  1 mon.rack1-ram-6@0(leader).osd e248 prepare_failure osd.2 10.242.42.114:6871/25606 from osd.22 10.242.42.112:6822/23764 is reporting failure:1
2015-02-02 20:28:33.012075 7fbaabea4700  0 log_channel(cluster) log [DBG] : osd.2 10.242.42.114:6871/25606 reported failed by osd.22 10.242.42.112:6822/23764
2015-02-02 20:28:33.012087 7fbaabea4700  1 mon.rack1-ram-6@0(leader).osd e248  we have enough reports/reporters to mark osd.2 down
2015-02-02 20:28:33.012098 7fbaabea4700  0 log_channel(cluster) log [INF] : osd.2 10.242.42.114:6871/25606 failed (3 reports from 3 peers after 21.000102 >= grace 20.000000)
2015-02-02 20:28:33.071370 7fbaae0ea700  1 mon.rack1-ram-6@0(leader).osd e249 e249: 32 osds: 23 up, 32 in

Thanks
Sahana

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dan van der Ster
Sent: Thursday, February 05, 2015 2:41 PM
To: Sage Weil
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Monitor Restart triggers half of our OSDs marked down

On Thu, Feb 5, 2015 at 9:54 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 5 Feb 2015, Dan van der Ster wrote:
>> Hi,
>> We also have seen this once after upgrading to 0.80.8 (from dumpling).
>> Last week we had a network outage which marked out around 1/3rd of
>> our OSDs. The outage lasted less than a minute -- all the OSDs were
>> brought up once the network was restored.
>>
>> Then 30 minutes later I restarted one monitor to roll out a small
>> config change (changing leveldb log path). Surprisingly that resulted
>> in many OSDs (but seemingly fewer than before) being marked out again
>> then quickly marked in again.
>
> Did the 'wrongly marked down' messages appear in ceph.log?
>

Yes. Here's the start of the real outage:

2015-01-29 10:59:52.452367 mon.0 128.142.35.220:6789/0 417 : [INF] pgmap v35845354: 24608 pgs: 10 active+clean+scrubbing+deep, 24596
active+clean, 2 active+clean+scrubbing; 125 TB
 data, 377 TB used, 2021 TB / 2399 TB avail; 5137 kB/s rd, 31239 kB/s wr, 641 op/s
2015-01-29 10:59:52.618591 mon.0 128.142.35.220:6789/0 431 : [INF]
osd.1129 128.142.23.100:6912/90010 failed (3 reports from 3 peers after 20.000159 >= grace 20.000000)
2015-01-29 10:59:52.939018 mon.0 128.142.35.220:6789/0 479 : [INF]
osd.1080 128.142.23.111:6869/124575 failed (3 reports from 3 peers after 20.000181 >= grace 20.000000)
2015-01-29 10:59:53.147616 mon.0 128.142.35.220:6789/0 510 : [INF]
osd.1113 128.142.23.107:6810/25247 failed (3 reports from 3 peers after 20.525957 >= grace 20.000000)
2015-01-29 10:59:53.342428 mon.0 128.142.35.220:6789/0 538 : [INF]
osd.1136 128.142.23.100:6864/86061 failed (3 reports from 3 peers after 20.403032 >= grace 20.000000)
2015-01-29 10:59:53.342557 mon.0 128.142.35.220:6789/0 540 : [INF]
osd.1144 128.142.23.100:6883/91968 failed (3 reports from 3 peers after 20.403104 >= grace 20.000000)

Then many "wrongly marked me down" messages around a minute later when the network came back.

But then when I restarted the (peon) monitor:

2015-01-29 11:29:18.250750 mon.0 128.142.35.220:6789/0 10570 : [INF] pgmap v35847068: 24608 pgs: 1 active+clean+scrubbing+deep, 24602
active+clean, 5 active+clean+scrubbing; 125 T
B data, 377 TB used, 2021 TB / 2399 TB avail; 193 MB/s rd, 238 MB/s wr, 7410 op/s
2015-01-29 11:29:28.844678 mon.3 128.142.39.77:6789/0 1 : [INF] mon.2 calling new monitor election
2015-01-29 11:29:33.846946 mon.2 128.142.36.229:6789/0 9 : [INF] mon.4 calling new monitor election
2015-01-29 11:29:33.847022 mon.4 128.142.39.144:6789/0 7 : [INF] mon.3 calling new monitor election
2015-01-29 11:29:33.847085 mon.1 128.142.36.227:6789/0 24 : [INF]
mon.1 calling new monitor election
2015-01-29 11:29:33.853498 mon.3 128.142.39.77:6789/0 2 : [INF] mon.2 calling new monitor election
2015-01-29 11:29:33.895660 mon.0 128.142.35.220:6789/0 10860 : [INF]
mon.0 calling new monitor election
2015-01-29 11:29:33.901335 mon.0 128.142.35.220:6789/0 10861 : [INF]
mon.0@0 won leader election with quorum 0,1,2,3,4
2015-01-29 11:29:34.004028 mon.0 128.142.35.220:6789/0 10862 : [INF] monmap e5: 5 mons at {0=128.142.35.220:6789/0,1=128.142.36.227:6789/0,2=128.142.39.77:6789/0,3=128.142.39.144:6789/0,4=128.142.36.229:6789/0}
2015-01-29 11:29:34.005808 mon.0 128.142.35.220:6789/0 10863 : [INF] pgmap v35847069: 24608 pgs: 1 active+clean+scrubbing+deep, 24602
active+clean, 5 active+clean+scrubbing; 125 TB data, 377 TB used, 2021
TB / 2399 TB avail; 54507 kB/s rd, 85412 kB/s wr, 1967 op/s
2015-01-29 11:29:34.006111 mon.0 128.142.35.220:6789/0 10864 : [INF] mdsmap e157: 1/1/1 up {0=0=up:active}
2015-01-29 11:29:34.007165 mon.0 128.142.35.220:6789/0 10865 : [INF] osdmap e132055: 880 osds: 880 up, 880 in
2015-01-29 11:29:34.037367 mon.0 128.142.35.220:6789/0 11055 : [INF]
osd.1202 128.142.23.104:6801/98353 failed (4 reports from 3 peers after 29.673699 >= grace 28.948726)
2015-01-29 11:29:34.050478 mon.0 128.142.35.220:6789/0 11139 : [INF]
osd.1164 128.142.23.102:6850/22486 failed (3 reports from 2 peers after 30.685537 >= grace 28.946983)

and then just after:

2015-01-29 11:29:35.210184 osd.1202 128.142.23.104:6801/98353 59 :
[WRN] map e132056 wrongly marked me down
2015-01-29 11:29:35.441922 osd.1164 128.142.23.102:6850/22486 25 :
[WRN] map e132056 wrongly marked me down

plus many other OSDs like that.

Here is the full ceph.log from that day:
https://www.dropbox.com/s/kzb2e3h3uirhejm/ceph.log-20150130.gz?dl=0
And here is ceph-mon.1.log:
https://www.dropbox.com/s/w1mx8rcflffp5e8/ceph-mon.0.log-20150130.gz?dl=0
It has a bit more info (prepare_failure, 'ignoring stats from non-active osd')  but alas I had only debug mon = 1

Cheers, Dan

>> I only have the lowest level logs from this incident -- but I think
>> it's easily reproducable.
>
> Logs with debug ms = 1 and debug mon = 20 would be best if someone is
> able to reproduce this.  I think our QA wouldn't capture this case
> because when we are either thrashing monitors and not OSDs (so OSD is
> never marked
> down) or we are thrashing OSDs and not mons (and ignoring
> wrongly-marked-down events).
>
> sage
>
>>
>> Cheers, Dan
>>
>>
>> On Wed, Feb 4, 2015 at 12:06 PM, Christian Eichelmann
>> <christian.eichelmann@xxxxxxxx> wrote:
>> > Hi Greg,
>> >
>> > the behaviour is indeed strange. Today I was trying to reproduce
>> > the problem, but no matter which monitor I've restarted, no matter
>> > how many times, the bahviour was like expected: A new monitor
>> > election was called and everything contiuned normally.
>> >
>> > Then I continued my failover tests and simulated the failure of two
>> > racks with iptables (for us: 2 MON and & 6 OSD Server with in sum
>> > 360 OSDs)
>> >
>> > Afterwards I tried again to restart one monitor and again about 240
>> > OSDs got marked as down.
>> >
>> > There was no load on our monitor servers in that period. On one of
>> > the OSDs which got marked down I found lot's of those messages:
>> >
>> > 2015-02-04 11:55:22.788245 7fc48fa48700  0 --
>> > 10.76.70.4:6997/17094790
>> >>> 10.76.70.8:6806/3303244 pipe(0x7a1b600 sd=198 :59766 s=2 pgs=1353
>> > cs=1 l=0 c=0x4e562c0).fault with nothing to send, going to standby
>> > 2015-02-04 11:55:22.788371 7fc48be0c700  0 --
>> > 10.76.70.4:6997/17094790
>> >>> 10.76.70.8:6842/12012876 pipe(0x895e840 sd=188 :49283 s=2
>> >>> pgs=36873
>> > cs=1 l=0 c=0x13226f20).fault with nothing to send, going to standby
>> > 2015-02-04 11:55:22.788458 7fc494e9c700  0 --
>> > 10.76.70.4:6997/17094790
>> >>> 10.76.70.13:6870/13021609 pipe(0x13ace2c0 sd=117 :64130 s=2
>> >>> pgs=38011
>> > cs=1 l=0 c=0x52b4840).fault with nothing to send, going to standby
>> > 2015-02-04 11:55:22.797107 7fc46459d700  0 -- 10.76.70.4:0/94790 >>
>> > 10.76.70.11:6980/37144571 pipe(0xba0c580 sd=30 :0 s=1 pgs=0 cs=0
>> > l=1 c=0x4e51600).fault
>> > 2015-02-04 11:55:22.799350 7fc482d7d700  0 --
>> > 10.76.70.4:6997/17094790
>> >>> 10.76.70.10:6887/30410592 pipe(0x6a0cb00 sd=271 :53090 s=2
>> >>> pgs=15372
>> > cs=1 l=0 c=0xf3a6f20).fault with nothing to send, going to standby
>> > 2015-02-04 11:55:22.800018 7fc46429a700  0 -- 10.76.70.4:0/94790 >>
>> > 10.76.28.41:7076/37144571 pipe(0xba0c840 sd=59 :0 s=1 pgs=0 cs=0
>> > l=1 c=0xf339760).fault
>> > 2015-02-04 11:55:22.803086 7fc482272700  0 --
>> > 10.76.70.4:6997/17094790
>> >>> 10.76.70.5:6867/17011547 pipe(0x12f998c0 sd=294 :6997 s=2
>> >>> pgs=46095
>> > cs=1 l=0 c=0x8382000).fault with nothing to send, going to standby
>> > 2015-02-04 11:55:22.804736 7fc4892e1700  0 --
>> > 10.76.70.4:6997/17094790
>> >>> 10.76.70.13:6852/9142109 pipe(0x12fa5b80 sd=163 :57056 s=2
>> >>> pgs=45269
>> > cs=1 l=0 c=0x189d1600).fault with nothing to send, going to standby
>> >
>> > The IPs mentioned there are all OSD Server.
>> >
>> > For me it feels like the monitors still have some "memory" about
>> > the failed OSDs and something is happening when one of the goes
>> > down. If I can provide you any more information to clarify the
>> > issue, just tell me what you need.
>> >
>> > Regards,
>> > Christian
>> >
>> > Am 03.02.2015 18:10, schrieb Gregory Farnum:
>> >> On Tue, Feb 3, 2015 at 3:38 AM, Christian Eichelmann
>> >> <christian.eichelmann@xxxxxxxx> wrote:
>> >>> Hi all,
>> >>>
>> >>> during some failover tests and some configuration tests, we
>> >>> currently discover a strange phenomenon:
>> >>>
>> >>> Restarting one of our monitors (5 in sum) triggers about 300 of
>> >>> the following events:
>> >>>
>> >>> osd.669 10.76.28.58:6935/149172 failed (20 reports from 20 peers
>> >>> after
>> >>> 22.005858 >= grace 20.000000)
>> >>>
>> >>> The osds come back up shortly after the have been marked down.
>> >>> What I don't understand is: How can a restart of one monitor
>> >>> prevent the osds from talking to each other and marking them down?
>> >>>
>> >>> FYI:
>> >>> We are currently using the following settings:
>> >>> mon osd adjust hearbeat grace = false mon osd min down reporters
>> >>> = 20 mon osd adjust down out interval = false
>> >>
>> >> That's really strange. I think maybe you're seeing some kind of
>> >> secondary effect; what kind of CPU usage are you seeing on the
>> >> monitors during this time? Have you checked the log on any OSDs
>> >> which have been marked down?
>> >>
>> >> I have a suspicion that maybe the OSDs are detecting their failed
>> >> monitor connection and not being able to reconnect to another
>> >> monitor quickly enough, but I'm not certain what the overlaps are there.
>> >> -Greg
>> >>
>> >
>> >
>> > --
>> > Christian Eichelmann
>> > Systemadministrator
>> >
>> > 1&1 Internet AG - IT Operations Mail & Media Advertising &
>> > Targeting Brauerstraße 48 · DE-76135 Karlsruhe
>> > Telefon: +49 721 91374-8026
>> > christian.eichelmann@xxxxxxxx
>> >
>> > Amtsgericht Montabaur / HRB 6484
>> > Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich,
>> > Robert Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver
>> > Mauss, Jan Oetjen
>> > Aufsichtsratsvorsitzender: Michael Scheeren
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com