Hi, We had a monitor drop out of quorum a few weeks back and we have been unable to bring it back into sync. when starting, it synchronises the OSD maps, and then it just restarts from fresh every time. When turning the logging up to log level 20, we see this: 2017-11-07 09:31:57.230333 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 handle_sync mon_sync(chunk cookie 213137752259 lc 128962692 bl 951895 bytes last_key osdmap,1405040) v2 2017-11-07 09:31:57.230336 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 handle_sync_chunk mon_sync(chunk cookie 213137752259 lc 128962692 bl 951895 bytes last_key osdmap,1405040) v2 2017-11-07 09:31:57.296945 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 sync_reset_timeout 2017-11-07 09:31:57.296975 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 sync_get_next_chunk cookie 213137752259 provider mon.2 10.132.194.132:6789/0 2017-11-07 09:31:58.190967 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 _ms_dispatch existing session 0x561f3c7f2c40 for mon.2 10.132.194.132:6789/0 2017-11-07 09:31:58.190978 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 caps allow * 2017-11-07 09:31:58.190999 7f2da95ff700 20 is_capable service=mon command= read on cap allow * 2017-11-07 09:31:58.191003 7f2da95ff700 20 allow so far , doing grant allow * 2017-11-07 09:31:58.191005 7f2da95ff700 20 allow all 2017-11-07 09:31:58.191008 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 handle_sync mon_sync(chunk cookie 213137752259 lc 128962692 bl 1048394 bytes last_key osdmap,883766) v2 2017-11-07 09:31:58.191013 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 handle_sync_chunk mon_sync(chunk cookie 213137752259 lc 128962692 bl 1048394 bytes last_key osdmap,883766) v2 2017-11-07 09:31:58.315140 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 sync_reset_timeout 2017-11-07 09:31:58.315170 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 sync_get_next_chunk cookie 213137752259 provider mon.2 10.132.194.132:6789/0 2017-11-07 09:32:00.418905 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 _ms_dispatch existing session 0x561f3c7f2c40 for mon.2 10.132.194.132:6789/0 2017-11-07 09:32:00.418918 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 caps allow * 2017-11-07 09:32:00.418932 7f2da95ff700 20 is_capable service=mon command= read on cap allow * 2017-11-07 09:32:00.418936 7f2da95ff700 20 allow so far , doing grant allow * 2017-11-07 09:32:00.418943 7f2da95ff700 20 allow all As you can see, it reaches the most recent osdmap, then it rolls back to the first one. There is no message in the log to give us any indication as to why it is restarting, although I would not be surprised if it is because the monitor is too far out of sync when it reaches this point in the sync process. Unfortunately our cluster has been unhealthy for a while due to effectively running out of space. This is why there are so many OSDMaps hanging around at the moment. Generally this is tending back towards health (and hence it will conduct the mother of all cleanup operations) fairly shortly. Any help/pointers would be greatly appreciated [global] fsid = caaaba57-1ec7-461a-9211-ea166d311820 mon initial members = ceph-mon-01, ceph-mon-03, ceph-mon-05 # We should REALLY use DNS here, eg mon.txc1.us.livelink.io # librados understands RR DNS mon host = 10.132.194.128,10.132.194.130,10.132.194.132 mon data avail crit = 1 auth cluster required = cephx auth service required = cephx auth client required = cephx rbd default features = 3 osd_map_message_max = 20 rados osd op timeout = 60 [mon] # This is a big cluster, we should stay low on PGs mon_osd_min_up_ratio = 0.7 mon_pg_warn_min_per_osd = 20 mon_pg_warn_max_per_osd = 200 mon_warn_on_legacy_crush_tunables = false # Prevent flapping of OSDs marking eachother down mon_osd_min_down_reporters = 10 mon_osd_min_down_reports = 5
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com