Re: monitor takes long time to join quorum: STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH got BADAUTHORIZER

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 22 Sep 2017 13:54:02 -0700

On Thu, Sep 21, 2017 at 3:02 AM, Sean Purdy <s.purdy@xxxxxxxxxxxxxxxx> wrote:
> On Wed, 20 Sep 2017, Gregory Farnum said:
>> That definitely sounds like a time sync issue. Are you *sure* they matched
>> each other?
>
> NTP looked OK at the time.  But see below.
>
>
>> Is it reproducible on restart?
>
> Today I did a straight reboot - and it was fine, no issues.
>
>
> The issue occurs after the machine is off for a number of hours, or has been worked on in the BIOS for a number of hours and then booted.  And then perhaps waited at the disk decrypt key prompt.
>
> So I'd suspect hardware clock drift at those times.  (Using Dell R720xd machines)
>
>
> Logs show a time change a few seconds after boot.  After boot it's running NTP and within that 45 minute period the NTP state looks the same as the other nodes in the (small) cluster.
>
> How much drift is allowed between monitors?
>
>
> Logs say:
>
> Sep 20 09:45:21 store03 ntp[2329]: Starting NTP server: ntpd.
> Sep 20 09:45:21 store03 ntpd[2462]: proto: precision = 0.075 usec (-24)
> ...
> Sep 20 09:46:44 store03 systemd[1]: Time has been changed
> Sep 20 09:46:44 store03 ntpd[2462]: receive: Unexpected origin timestamp 0xdd6ca972.c694801d does not match aorg 0000000000.00000000 from server@172.16.0.16 xmt 0xdd6ca974.0c5c18f
>
> So system time was changed about 6 seconds after disks were unlocked/boot proceeded.  But there was still 45 minutes of monitor messages after that.  Surely the time should have converged sooner than 45 minutes?

Hmm, yeah. The ttl on cephx stuff is 1 hour so unless it was pretty
far off and then didn't trigger stuff to regenerate, it shouldn't have
taken 45 minutes.

You might be able to get more info out of it by turning up the debug
on the other monitors, and also setting the debug_auth value up on
them. :/
-Greg

>
>
>
> NTP from today, post-problem.  But ntpq at the time of the problem looked just as OK:
>
> store01:~$ ntpstat
> synchronised to NTP server (172.16.0.19) at stratum 3
>    time correct to within 47 ms
>
> store02$ ntpstat
> synchronised to NTP server (172.16.0.19) at stratum 3
>    time correct to within 63 ms
>
> store03:~$ sudo ntpstat
> synchronised to NTP server (172.16.0.19) at stratum 3
>    time correct to within 63 ms
>
> store03:~$ ntpq -p
>      remote           refid      st t when poll reach   delay   offset  jitter
> ==============================================================================
> +172.16.0.16     85.91.1.164      3 u  561 1024  377    0.287    0.554   0.914
> +172.16.0.18     94.125.129.7     3 u  411 1024  377    0.388   -0.331   0.139
> *172.16.0.19     158.43.128.33    2 u  289 1024  377    0.282   -0.005   0.103
>
>
> Sean
>
>
>> On Wed, Sep 20, 2017 at 2:50 AM Sean Purdy <s.purdy@xxxxxxxxxxxxxxxx> wrote:
>>
>> >
>> > Hi,
>> >
>> >
>> > Luminous 12.2.0
>> >
>> > Three node cluster, 18 OSD, debian stretch.
>> >
>> >
>> > One node is down for maintenance for several hours.  When bringing it back
>> > up, OSDs rejoin after 5 minutes, but health is still warning.  monitor has
>> > not joined quorum after 40 minutes and logs show BADAUTHORIZER message
>> > every time the monitor tries to connect to the leader.
>> >
>> > 2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
>> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
>> > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
>> > l=0).handle_connect_reply connect got BADAUTHORIZER
>> >
>> > Then after ~45 minutes monitor *does* join quorum.
>> >
>> > I'm presuming this isn't normal behaviour?  Or if it is, let me know and I
>> > won't worry.
>> >
>> > All three nodes are using ntp and look OK timewise.
>> >
>> >
>> > ceph-mon log:
>> >
>> > (.43 is leader, .45 is rebooted node, .44 is other live node in quorum)
>> >
>> > Boot:
>> >
>> > 2017-09-20 09:45:21.874152 7f49efeb8f80  0 ceph version 12.2.0
>> > (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc), process
>> > (unknown), pid 2243
>> >
>> > 2017-09-20 09:46:01.824708 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
>> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789
>> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
>> > accept connect_seq 3 vs existing csq=0 existing_state=STATE_CONNECTING
>> > 2017-09-20 09:46:01.824723 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
>> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789
>> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
>> > accept we reset (peer sent cseq 3, 0x5600722c0000.cseq = 0), sending
>> > RESETSESSION
>> > 2017-09-20 09:46:01.825247 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
>> > 172.16.0.44:6789/0 conn(0x56007244d000 :6789
>> > s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
>> > accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING
>> > 2017-09-20 09:46:01.828053 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
>> > 172.16.0.44:6789/0 conn(0x5600722c0000 :-1
>> > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=21872 cs=1 l=0).process
>> > missed message?  skipped from seq 0 to 552717734
>> >
>> > 2017-09-20 09:46:05.580342 7f49e1b27700  0 -- 172.16.0.45:6789/0 >>
>> > 172.16.0.43:6789/0 conn(0x5600720fe800 :-1
>> > s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=49261 cs=1 l=0).process
>> > missed message?  skipped from seq 0 to 1151972199
>> > 2017-09-20 09:46:05.581097 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
>> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
>> > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
>> > l=0).handle_connect_reply connect got BADAUTHORIZER
>> > 2017-09-20 09:46:05.581590 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
>> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
>> > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
>> > l=0).handle_connect_reply connect got BADAUTHORIZER
>> > ...
>> > [message repeats for 45 minutes]
>> > ...
>> > 2017-09-20 10:23:38.818767 7f49e2b29700  0 -- 172.16.0.45:0/2243 >>
>> > 172.16.0.43:6812/2422 conn(0x5600720fb800 :-1
>> > s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0
>> > l=0).handle_connect_reply connect
>> >  got BADAUTHORIZER
>> >
>> >
>> > At this point, "ceph mon stat" says .45/store03 not in quorum:
>> >
>> > e5: 3 mons at {store01=
>> > 172.16.0.43:6789/0,store02=172.16.0.44:6789/0,store03=172.16.0.45:6789/0},
>> > election epoch 376, leader 0 store01, quorum 0,1 store01,store02
>> >
>> >
>> > Then suddenly a valid connection is made and sync happens:
>> >
>> > 2017-09-20 10:23:43.041009 7f49e5b2f700  1 mon.store03@2(synchronizing).mds
>> > e1 Unable to load 'last_metadata'
>> > 2017-09-20 10:23:43.041967 7f49e5b2f700  1 mon.store03@2(synchronizing).osd
>> > e2381 e2381: 18 total, 13 up, 14 in
>> > ...
>> > 2017-09-20 10:23:43.045961 7f49e5b2f700  1 mon.store03@2(synchronizing).osd
>> > e2393 e2393: 18 total, 15 up, 15 in
>> > ...
>> > 2017-09-20 10:23:43.049255 7f49e5b2f700  1 mon.store03@2(synchronizing).osd
>> > e2406 e2406: 18 total, 18 up, 18 in
>> > ...
>> > 2017-09-20 10:23:43.054828 7f49e5b2f700  0 log_channel(cluster) log [INF]
>> > : mon.store03 calling new monitor election
>> > 2017-09-20 10:23:43.054901 7f49e5b2f700  1 mon.store03@2(electing).elector(372)
>> > init, last seen epoch 372
>> >
>> >
>> > Now "ceph mon stat" says:
>> >
>> > e5: 3 mons at {store01=
>> > 172.16.0.43:6789/0,store02=172.16.0.44:6789/0,store03=172.16.0.45:6789/0},
>> > election epoch 378, leader 0 store01, quorum 0,1,2 store01,store02,store03
>> >
>> > and everything's happy.
>> >
>> >
>> > What should I look for/fix?  It's a fairly vanilla system.
>> >
>> >
>> > Thanks in advance,
>> >
>> > Sean Purdy
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com