Problem syncing monitor

Stuart Harland <s.harland@xxxxxxxxxxxxxxxxxxxxxx> · Tue, 7 Nov 2017 10:07:17 +0000

Hi,
We had a monitor drop out of quorum a few weeks back and we have been unable to bring it back into sync.

when starting, it synchronises the OSD maps, and then it just restarts from fresh every time.

When turning the logging up to log level 20, we see this:

2017-11-07 09:31:57.230333 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 handle_sync mon_sync(chunk cookie 213137752259 lc 128962692 bl 951895 bytes last_key osdmap,1405040) v2
2017-11-07 09:31:57.230336 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 handle_sync_chunk mon_sync(chunk cookie 213137752259 lc 128962692 bl 951895 bytes last_key osdmap,1405040) v2
2017-11-07 09:31:57.296945 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 sync_reset_timeout
2017-11-07 09:31:57.296975 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 sync_get_next_chunk cookie 213137752259 provider mon.2 10.132.194.132:6789/0
2017-11-07 09:31:58.190967 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 _ms_dispatch existing session 0x561f3c7f2c40 for mon.2 10.132.194.132:6789/0
2017-11-07 09:31:58.190978 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6  caps allow *
2017-11-07 09:31:58.190999 7f2da95ff700 20 is_capable service=mon command= read on cap allow *
2017-11-07 09:31:58.191003 7f2da95ff700 20  allow so far , doing grant allow *
2017-11-07 09:31:58.191005 7f2da95ff700 20  allow all
2017-11-07 09:31:58.191008 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 handle_sync mon_sync(chunk cookie 213137752259 lc 128962692 bl 1048394 bytes last_key osdmap,883766) v2
2017-11-07 09:31:58.191013 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 handle_sync_chunk mon_sync(chunk cookie 213137752259 lc 128962692 bl 1048394 bytes last_key osdmap,883766) v2
2017-11-07 09:31:58.315140 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 sync_reset_timeout
2017-11-07 09:31:58.315170 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 sync_get_next_chunk cookie 213137752259 provider mon.2 10.132.194.132:6789/0
2017-11-07 09:32:00.418905 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 _ms_dispatch existing session 0x561f3c7f2c40 for mon.2 10.132.194.132:6789/0
2017-11-07 09:32:00.418918 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6  caps allow *
2017-11-07 09:32:00.418932 7f2da95ff700 20 is_capable service=mon command= read on cap allow *
2017-11-07 09:32:00.418936 7f2da95ff700 20  allow so far , doing grant allow *
2017-11-07 09:32:00.418943 7f2da95ff700 20  allow all

As you can see, it reaches the most recent osdmap, then it rolls back to the first one.

There is no message in the log to give us any indication as to why it is restarting, although I would not be surprised if it is because the monitor is too far out of sync when it reaches this point in the sync process.

Unfortunately our cluster has been unhealthy for a while due to effectively running out of space. This is why there are so many OSDMaps hanging around at the moment. Generally this is tending back towards health (and hence it will conduct the mother of all cleanup operations) fairly shortly.

Any help/pointers would be greatly appreciated

[global]
fsid = caaaba57-1ec7-461a-9211-ea166d311820
mon initial members = ceph-mon-01, ceph-mon-03, ceph-mon-05

# We should REALLY use DNS here, eg mon.txc1.us.livelink.io
# librados understands RR DNS
mon host = 10.132.194.128,10.132.194.130,10.132.194.132
mon data avail crit = 1
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

rbd default features = 3
osd_map_message_max = 20

rados osd op timeout = 60

[mon]
# This is a big cluster, we should stay low on PGs
mon_osd_min_up_ratio = 0.7
mon_pg_warn_min_per_osd = 20
mon_pg_warn_max_per_osd = 200
mon_warn_on_legacy_crush_tunables = false

# Prevent flapping of OSDs marking eachother down
mon_osd_min_down_reporters = 10
mon_osd_min_down_reports = 5

Stuart Harland
Infrastructure Engineer
s.harland@xxxxxxxxxxxxxxxxxxxxxx
Tel: +44 (0)207 183 1411

LiveLink Technology Ltd
McCormack House
56A East Street
Havant
PO9 1BS

Please note: Prices quoted are Ex-VAT, offers expire 30 days after issue 
IMPORTANT: The information transmitted in this e-mail is intended only for the person or entity to whom it is addressed and may contain confidential and/or privileged information. If you are not the intended recipient of this message, please do not read, copy, use or disclose this communication and notify the sender immediately. Any review, retransmission, dissemination or other use of, or taking any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. Any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of LiveLink. This e-mail message has been checked for the presence of computer viruses. However, LiveLink is not able to accept liability for any damage caused by this e-mail.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com