Re: Monitor Restart triggers half of our OSDs marked down

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 5 Feb 2015 10:10:44 +0100

On Thu, Feb 5, 2015 at 9:54 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 5 Feb 2015, Dan van der Ster wrote:
>> Hi,
>> We also have seen this once after upgrading to 0.80.8 (from dumpling).
>> Last week we had a network outage which marked out around 1/3rd of our
>> OSDs. The outage lasted less than a minute -- all the OSDs were
>> brought up once the network was restored.
>>
>> Then 30 minutes later I restarted one monitor to roll out a small
>> config change (changing leveldb log path). Surprisingly that resulted
>> in many OSDs (but seemingly fewer than before) being marked out again
>> then quickly marked in again.
>
> Did the 'wrongly marked down' messages appear in ceph.log?
>

Yes. Here's the start of the real outage:

2015-01-29 10:59:52.452367 mon.0 128.142.35.220:6789/0 417 : [INF]
pgmap v35845354: 24608 pgs: 10 active+clean+scrubbing+deep, 24596
active+clean, 2 active+clean+scrubbing; 125 TB
 data, 377 TB used, 2021 TB / 2399 TB avail; 5137 kB/s rd, 31239 kB/s
wr, 641 op/s
2015-01-29 10:59:52.618591 mon.0 128.142.35.220:6789/0 431 : [INF]
osd.1129 128.142.23.100:6912/90010 failed (3 reports from 3 peers
after 20.000159 >= grace 20.000000)
2015-01-29 10:59:52.939018 mon.0 128.142.35.220:6789/0 479 : [INF]
osd.1080 128.142.23.111:6869/124575 failed (3 reports from 3 peers
after 20.000181 >= grace 20.000000)
2015-01-29 10:59:53.147616 mon.0 128.142.35.220:6789/0 510 : [INF]
osd.1113 128.142.23.107:6810/25247 failed (3 reports from 3 peers
after 20.525957 >= grace 20.000000)
2015-01-29 10:59:53.342428 mon.0 128.142.35.220:6789/0 538 : [INF]
osd.1136 128.142.23.100:6864/86061 failed (3 reports from 3 peers
after 20.403032 >= grace 20.000000)
2015-01-29 10:59:53.342557 mon.0 128.142.35.220:6789/0 540 : [INF]
osd.1144 128.142.23.100:6883/91968 failed (3 reports from 3 peers
after 20.403104 >= grace 20.000000)

Then many "wrongly marked me down" messages around a minute later when
the network came back.

But then when I restarted the (peon) monitor:

2015-01-29 11:29:18.250750 mon.0 128.142.35.220:6789/0 10570 : [INF]
pgmap v35847068: 24608 pgs: 1 active+clean+scrubbing+deep, 24602
active+clean, 5 active+clean+scrubbing; 125 T
B data, 377 TB used, 2021 TB / 2399 TB avail; 193 MB/s rd, 238 MB/s
wr, 7410 op/s
2015-01-29 11:29:28.844678 mon.3 128.142.39.77:6789/0 1 : [INF] mon.2
calling new monitor election
2015-01-29 11:29:33.846946 mon.2 128.142.36.229:6789/0 9 : [INF] mon.4
calling new monitor election
2015-01-29 11:29:33.847022 mon.4 128.142.39.144:6789/0 7 : [INF] mon.3
calling new monitor election
2015-01-29 11:29:33.847085 mon.1 128.142.36.227:6789/0 24 : [INF]
mon.1 calling new monitor election
2015-01-29 11:29:33.853498 mon.3 128.142.39.77:6789/0 2 : [INF] mon.2
calling new monitor election
2015-01-29 11:29:33.895660 mon.0 128.142.35.220:6789/0 10860 : [INF]
mon.0 calling new monitor election
2015-01-29 11:29:33.901335 mon.0 128.142.35.220:6789/0 10861 : [INF]
mon.0@0 won leader election with quorum 0,1,2,3,4
2015-01-29 11:29:34.004028 mon.0 128.142.35.220:6789/0 10862 : [INF]
monmap e5: 5 mons at
{0=128.142.35.220:6789/0,1=128.142.36.227:6789/0,2=128.142.39.77:6789/0,3=128.142.39.144:6789/0,4=128.142.36.229:6789/0}
2015-01-29 11:29:34.005808 mon.0 128.142.35.220:6789/0 10863 : [INF]
pgmap v35847069: 24608 pgs: 1 active+clean+scrubbing+deep, 24602
active+clean, 5 active+clean+scrubbing; 125 TB data, 377 TB used, 2021
TB / 2399 TB avail; 54507 kB/s rd, 85412 kB/s wr, 1967 op/s
2015-01-29 11:29:34.006111 mon.0 128.142.35.220:6789/0 10864 : [INF]
mdsmap e157: 1/1/1 up {0=0=up:active}
2015-01-29 11:29:34.007165 mon.0 128.142.35.220:6789/0 10865 : [INF]
osdmap e132055: 880 osds: 880 up, 880 in
2015-01-29 11:29:34.037367 mon.0 128.142.35.220:6789/0 11055 : [INF]
osd.1202 128.142.23.104:6801/98353 failed (4 reports from 3 peers
after 29.673699 >= grace 28.948726)
2015-01-29 11:29:34.050478 mon.0 128.142.35.220:6789/0 11139 : [INF]
osd.1164 128.142.23.102:6850/22486 failed (3 reports from 2 peers
after 30.685537 >= grace 28.946983)

and then just after:

2015-01-29 11:29:35.210184 osd.1202 128.142.23.104:6801/98353 59 :
[WRN] map e132056 wrongly marked me down
2015-01-29 11:29:35.441922 osd.1164 128.142.23.102:6850/22486 25 :
[WRN] map e132056 wrongly marked me down

plus many other OSDs like that.

Here is the full ceph.log from that day:
https://www.dropbox.com/s/kzb2e3h3uirhejm/ceph.log-20150130.gz?dl=0
And here is ceph-mon.1.log:
https://www.dropbox.com/s/w1mx8rcflffp5e8/ceph-mon.0.log-20150130.gz?dl=0
It has a bit more info (prepare_failure, 'ignoring stats from
non-active osd')  but alas I had only debug mon = 1

Cheers, Dan

>> I only have the lowest level logs from this incident -- but I think
>> it's easily reproducable.
>
> Logs with debug ms = 1 and debug mon = 20 would be best if someone is able
> to reproduce this.  I think our QA wouldn't capture this case because when
> we are either thrashing monitors and not OSDs (so OSD is never marked
> down) or we are thrashing OSDs and not mons (and ignoring
> wrongly-marked-down events).
>
> sage
>
>>
>> Cheers, Dan
>>
>>
>> On Wed, Feb 4, 2015 at 12:06 PM, Christian Eichelmann
>> <christian.eichelmann@xxxxxxxx> wrote:
>> > Hi Greg,
>> >
>> > the behaviour is indeed strange. Today I was trying to reproduce the
>> > problem, but no matter which monitor I've restarted, no matter how many
>> > times, the bahviour was like expected: A new monitor election was called
>> > and everything contiuned normally.
>> >
>> > Then I continued my failover tests and simulated the failure of two
>> > racks with iptables (for us: 2 MON and & 6 OSD Server with in sum 360 OSDs)
>> >
>> > Afterwards I tried again to restart one monitor and again about 240 OSDs
>> > got marked as down.
>> >
>> > There was no load on our monitor servers in that period. On one of the
>> > OSDs which got marked down I found lot's of those messages:
>> >
>> > 2015-02-04 11:55:22.788245 7fc48fa48700  0 -- 10.76.70.4:6997/17094790
>> >>> 10.76.70.8:6806/3303244 pipe(0x7a1b600 sd=198 :59766 s=2 pgs=1353
>> > cs=1 l=0 c=0x4e562c0).fault with nothing to send, going to standby
>> > 2015-02-04 11:55:22.788371 7fc48be0c700  0 -- 10.76.70.4:6997/17094790
>> >>> 10.76.70.8:6842/12012876 pipe(0x895e840 sd=188 :49283 s=2 pgs=36873
>> > cs=1 l=0 c=0x13226f20).fault with nothing to send, going to standby
>> > 2015-02-04 11:55:22.788458 7fc494e9c700  0 -- 10.76.70.4:6997/17094790
>> >>> 10.76.70.13:6870/13021609 pipe(0x13ace2c0 sd=117 :64130 s=2 pgs=38011
>> > cs=1 l=0 c=0x52b4840).fault with nothing to send, going to standby
>> > 2015-02-04 11:55:22.797107 7fc46459d700  0 -- 10.76.70.4:0/94790 >>
>> > 10.76.70.11:6980/37144571 pipe(0xba0c580 sd=30 :0 s=1 pgs=0 cs=0 l=1
>> > c=0x4e51600).fault
>> > 2015-02-04 11:55:22.799350 7fc482d7d700  0 -- 10.76.70.4:6997/17094790
>> >>> 10.76.70.10:6887/30410592 pipe(0x6a0cb00 sd=271 :53090 s=2 pgs=15372
>> > cs=1 l=0 c=0xf3a6f20).fault with nothing to send, going to standby
>> > 2015-02-04 11:55:22.800018 7fc46429a700  0 -- 10.76.70.4:0/94790 >>
>> > 10.76.28.41:7076/37144571 pipe(0xba0c840 sd=59 :0 s=1 pgs=0 cs=0 l=1
>> > c=0xf339760).fault
>> > 2015-02-04 11:55:22.803086 7fc482272700  0 -- 10.76.70.4:6997/17094790
>> >>> 10.76.70.5:6867/17011547 pipe(0x12f998c0 sd=294 :6997 s=2 pgs=46095
>> > cs=1 l=0 c=0x8382000).fault with nothing to send, going to standby
>> > 2015-02-04 11:55:22.804736 7fc4892e1700  0 -- 10.76.70.4:6997/17094790
>> >>> 10.76.70.13:6852/9142109 pipe(0x12fa5b80 sd=163 :57056 s=2 pgs=45269
>> > cs=1 l=0 c=0x189d1600).fault with nothing to send, going to standby
>> >
>> > The IPs mentioned there are all OSD Server.
>> >
>> > For me it feels like the monitors still have some "memory" about the
>> > failed OSDs and something is happening when one of the goes down. If I
>> > can provide you any more information to clarify the issue, just tell me
>> > what you need.
>> >
>> > Regards,
>> > Christian
>> >
>> > Am 03.02.2015 18:10, schrieb Gregory Farnum:
>> >> On Tue, Feb 3, 2015 at 3:38 AM, Christian Eichelmann
>> >> <christian.eichelmann@xxxxxxxx> wrote:
>> >>> Hi all,
>> >>>
>> >>> during some failover tests and some configuration tests, we currently
>> >>> discover a strange phenomenon:
>> >>>
>> >>> Restarting one of our monitors (5 in sum) triggers about 300 of the
>> >>> following events:
>> >>>
>> >>> osd.669 10.76.28.58:6935/149172 failed (20 reports from 20 peers after
>> >>> 22.005858 >= grace 20.000000)
>> >>>
>> >>> The osds come back up shortly after the have been marked down. What I
>> >>> don't understand is: How can a restart of one monitor prevent the osds
>> >>> from talking to each other and marking them down?
>> >>>
>> >>> FYI:
>> >>> We are currently using the following settings:
>> >>> mon osd adjust hearbeat grace = false
>> >>> mon osd min down reporters = 20
>> >>> mon osd adjust down out interval = false
>> >>
>> >> That's really strange. I think maybe you're seeing some kind of
>> >> secondary effect; what kind of CPU usage are you seeing on the
>> >> monitors during this time? Have you checked the log on any OSDs which
>> >> have been marked down?
>> >>
>> >> I have a suspicion that maybe the OSDs are detecting their failed
>> >> monitor connection and not being able to reconnect to another monitor
>> >> quickly enough, but I'm not certain what the overlaps are there.
>> >> -Greg
>> >>
>> >
>> >
>> > --
>> > Christian Eichelmann
>> > Systemadministrator
>> >
>> > 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
>> > Brauerstraße 48 · DE-76135 Karlsruhe
>> > Telefon: +49 721 91374-8026
>> > christian.eichelmann@xxxxxxxx
>> >
>> > Amtsgericht Montabaur / HRB 6484
>> > Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
>> > Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
>> > Aufsichtsratsvorsitzender: Michael Scheeren
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com