Re: monitor ghosted

Peter Eisch <peter.eisch@xxxxxxxxxxxxxxx> · Thu, 9 Jan 2020 13:50:27 +0000

As oddly as it drifted away it came back.  Next time, should there be a next time, I will snag logs as suggested by Sascha.

The window for all this was, local time: 9:02 am - disassociated; 11:20 pm - associated.  No changes were made, I did reboot the mon02 host at 1 pm.  No other network or host issues were observed in the rest of the cluster or at the site.

Thank you for your replies and I'll gather better loggin next time.

peter

Peter Eisch
Senior Site Reliability Engineer
T 1.612.659.3228
virginpulse.com
| virginpulse.com/global-challenge
Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | Switzerland | United Kingdom | USA
Confidentiality Notice: The information contained in this e-mail, including any attachment(s), is intended solely for use by the designated recipient(s). Unauthorized use, dissemination, distribution, or reproduction of this message by anyone other than the intended recipient(s), or a person designated as responsible for delivering such messages to the intended recipient, is strictly prohibited and may be unlawful. This e-mail may contain proprietary, confidential or privileged information. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Virgin Pulse, Inc. If you have received this message in error, or are not the named recipient(s), please immediately notify the sender and delete this e-mail message.
v2.64

From: Brad Hubbard <bhubbard@xxxxxxxxxx>
Date: Wednesday, January 8, 2020 at 6:21 PM
To: Peter Eisch <peter.eisch@xxxxxxxxxxxxxxx>
Cc: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  monitor ghosted

On Thu, Jan 9, 2020 at 5:48 AM Peter Eisch <mailto:peter.eisch@xxxxxxxxxxxxxxx> wrote:
Hi,

This morning one of my three monitor hosts got booted from the Nautilus 14.2.4 cluster and it won’t regain. There haven’t been any changes, or events at this site at all. The conf file is the [unchanged] and the same as the other two monitors. The host is also running the MDS and MGR apps without any issue. The ceph-mon log shows this repeating:

2020-01-08 13:33:29.403 7fec1a736700 1 mon.cephmon02@1(probing) e7 handle_auth_request failed to assign global_id
2020-01-08 13:33:29.433 7fec1a736700 1 mon.cephmon02@1(probing) e7 handle_auth_request failed to assign global_id
2020-01-08 13:33:29.541 7fec1a736700 1 mon.cephmon02@1(probing) e7 handle_auth_request failed to assign global_id
...

Try gathering a log with debug_mon 20. That should provide more detail about why  AuthMonitor::_assign_global_id() didn't return an ID.

There is nothing in the logs of the two remaining/healthy monitors. What is my best practice to get this host back in the cluster?

peter

_______________________________________________
ceph-users mailing list
mailto:ceph-users@xxxxxxxxxxxxxx

-- 
Cheers,
Brad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com