[no subject]

Guang Yang <guangyy@xxxxxxxxx> · Fri, 13 Nov 2015 09:00:18 -0800

Hi Joao,
We have a problem when trying to add new monitors to the cluster on an
unhealthy cluster, which I would like ask for your suggestion.

After adding the new monitor, it  started syncing the store and went
into an infinite loop:

2015-11-12 21:02:23.499510 7f1e8030e700 10
mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
cookie 4513071120 lc 14697737 bl 929616 bytes last_key
osdmap,full_22530) v2
2015-11-12 21:02:23.712944 7f1e8030e700 10
mon.mon04c011@2(synchronizing) e5 handle_sync_chunk mon_sync(chunk
cookie 4513071120 lc 14697737 bl 799897 bytes last_key
osdmap,full_3259) v2

We talked early in the morning on IRC, and at the time I thought it
was because the osdmap epoch was increasing, which lead to this
infinite loop.

I then set those nobackfill/norecovery flags and the osdmap epoch
freezed, however, the problem is still there.

While the osdmap epoch is 22531, the switch always happened at
osdmap.full_22530 (as showed by the above log).

Looking at the code at both sides, it looks this check
(https://github.com/ceph/ceph/blob/master/src/mon/Monitor.cc#L1389)
always true, and I can confirm from the log that (sp.last_commited <
paxos->get_version()) was false, so the chance is that the
sp.synchronizer always has next chunk?

Does this look familiar to you? Or any other trouble shoot I can try?
Thanks very much.

Thanks,
Guang
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html