Thanks Sage! On Fri, Nov 13, 2015 at 4:15 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Fri, 13 Nov 2015, Guang Yang wrote: >> I was wrong the previous analysis, it was not the iterator got reset, >> the problem I can see now, is that during the syncing, a new round of >> election kicked off and thus it needs to probe the newly added >> monitor, however, since it hasn't been synced yet, it will restart the >> syncing from there. > > What version of this? I think this is something we fixed a while back? This is on Giant (c51c8f9d80fa4e0168aa52685b8de40e42758578), is there a commit I can take a look? > >> Hi Sage and Joao, >> Is there a way to freeze the election by some tunable to let the sync finish? > > We can't not do elections when something is asking for one (e.g., mon > is down). I see. Is there an operational workaround we could try? From within the log, I found the election was triggered by accepted timeout, thus I increased the timeout value to hopefully squeeze election during syncing, does that sounds a workaround? > > sage > > > >> >> Thanks, >> Guang >> >> On Fri, Nov 13, 2015 at 9:00 AM, Guang Yang <guangyy@xxxxxxxxx> wrote: >> > Hi Joao, >> > We have a problem when trying to add new monitors to the cluster on an >> > unhealthy cluster, which I would like ask for your suggestion. >> > >> > After adding the new monitor, it started syncing the store and went >> > into an infinite loop: >> > >> > 2015-11-12 21:02:23.499510 7f1e8030e700 10 >> > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk >> > cookie 4513071120 lc 14697737 bl 929616 bytes last_key >> > osdmap,full_22530) v2 >> > 2015-11-12 21:02:23.712944 7f1e8030e700 10 >> > mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk >> > cookie 4513071120 lc 14697737 bl 799897 bytes last_key >> > osdmap,full_3259) v2 >> > >> > >> > We talked early in the morning on IRC, and at the time I thought it >> > was because the osdmap epoch was increasing, which lead to this >> > infinite loop. >> > >> > I then set those nobackfill/norecovery flags and the osdmap epoch >> > freezed, however, the problem is still there. >> > >> > While the osdmap epoch is 22531, the switch always happened at >> > osdmap.full_22530 (as showed by the above log). >> > >> > Looking at the code at both sides, it looks this check >> > (https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389) >> > always true, and I can confirm from the log that (sp.last_commited < >> > paxos->get_version()) was false, so the chance is that the >> > sp.synchronizer always has next chunk? >> > >> > Does this look familiar to you? Or any other trouble shoot I can try? >> > Thanks very much. >> > >> > Thanks, >> > Guang >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html