Re: Monitor Restart triggers half of our OSDs marked down

Christian Eichelmann <christian.eichelmann@xxxxxxxx> · Wed, 04 Feb 2015 12:06:02 +0100

Hi Greg,

the behaviour is indeed strange. Today I was trying to reproduce the
problem, but no matter which monitor I've restarted, no matter how many
times, the bahviour was like expected: A new monitor election was called
and everything contiuned normally.

Then I continued my failover tests and simulated the failure of two
racks with iptables (for us: 2 MON and & 6 OSD Server with in sum 360 OSDs)

Afterwards I tried again to restart one monitor and again about 240 OSDs
got marked as down.

There was no load on our monitor servers in that period. On one of the
OSDs which got marked down I found lot's of those messages:

2015-02-04 11:55:22.788245 7fc48fa48700  0 -- 10.76.70.4:6997/17094790
>> 10.76.70.8:6806/3303244 pipe(0x7a1b600 sd=198 :59766 s=2 pgs=1353
cs=1 l=0 c=0x4e562c0).fault with nothing to send, going to standby
2015-02-04 11:55:22.788371 7fc48be0c700  0 -- 10.76.70.4:6997/17094790
>> 10.76.70.8:6842/12012876 pipe(0x895e840 sd=188 :49283 s=2 pgs=36873
cs=1 l=0 c=0x13226f20).fault with nothing to send, going to standby
2015-02-04 11:55:22.788458 7fc494e9c700  0 -- 10.76.70.4:6997/17094790
>> 10.76.70.13:6870/13021609 pipe(0x13ace2c0 sd=117 :64130 s=2 pgs=38011
cs=1 l=0 c=0x52b4840).fault with nothing to send, going to standby
2015-02-04 11:55:22.797107 7fc46459d700  0 -- 10.76.70.4:0/94790 >>
10.76.70.11:6980/37144571 pipe(0xba0c580 sd=30 :0 s=1 pgs=0 cs=0 l=1
c=0x4e51600).fault
2015-02-04 11:55:22.799350 7fc482d7d700  0 -- 10.76.70.4:6997/17094790
>> 10.76.70.10:6887/30410592 pipe(0x6a0cb00 sd=271 :53090 s=2 pgs=15372
cs=1 l=0 c=0xf3a6f20).fault with nothing to send, going to standby
2015-02-04 11:55:22.800018 7fc46429a700  0 -- 10.76.70.4:0/94790 >>
10.76.28.41:7076/37144571 pipe(0xba0c840 sd=59 :0 s=1 pgs=0 cs=0 l=1
c=0xf339760).fault
2015-02-04 11:55:22.803086 7fc482272700  0 -- 10.76.70.4:6997/17094790
>> 10.76.70.5:6867/17011547 pipe(0x12f998c0 sd=294 :6997 s=2 pgs=46095
cs=1 l=0 c=0x8382000).fault with nothing to send, going to standby
2015-02-04 11:55:22.804736 7fc4892e1700  0 -- 10.76.70.4:6997/17094790
>> 10.76.70.13:6852/9142109 pipe(0x12fa5b80 sd=163 :57056 s=2 pgs=45269
cs=1 l=0 c=0x189d1600).fault with nothing to send, going to standby

The IPs mentioned there are all OSD Server.

For me it feels like the monitors still have some "memory" about the
failed OSDs and something is happening when one of the goes down. If I
can provide you any more information to clarify the issue, just tell me
what you need.

Regards,
Christian

Am 03.02.2015 18:10, schrieb Gregory Farnum:
> On Tue, Feb 3, 2015 at 3:38 AM, Christian Eichelmann
> <christian.eichelmann@xxxxxxxx> wrote:
>> Hi all,
>>
>> during some failover tests and some configuration tests, we currently
>> discover a strange phenomenon:
>>
>> Restarting one of our monitors (5 in sum) triggers about 300 of the
>> following events:
>>
>> osd.669 10.76.28.58:6935/149172 failed (20 reports from 20 peers after
>> 22.005858 >= grace 20.000000)
>>
>> The osds come back up shortly after the have been marked down. What I
>> don't understand is: How can a restart of one monitor prevent the osds
>> from talking to each other and marking them down?
>>
>> FYI:
>> We are currently using the following settings:
>> mon osd adjust hearbeat grace = false
>> mon osd min down reporters = 20
>> mon osd adjust down out interval = false
> 
> That's really strange. I think maybe you're seeing some kind of
> secondary effect; what kind of CPU usage are you seeing on the
> monitors during this time? Have you checked the log on any OSDs which
> have been marked down?
> 
> I have a suspicion that maybe the OSDs are detecting their failed
> monitor connection and not being able to reconnect to another monitor
> quickly enough, but I'm not certain what the overlaps are there.
> -Greg
> 

-- 
Christian Eichelmann
Systemadministrator

1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
christian.eichelmann@xxxxxxxx

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com