On Wed, 20 Jun 2018, cgxu519 wrote: > Is there any specific log indicates what was happening? > > > On 06/19/2018 09:56 PM, xiangyang yu wrote: > > Hi cephers, > > Recently i met a problem in our production environment. > > My ceph version is hammer 0.94.5(it's too old though.) > > Osdmap(in the osd process) did not update epoch until the osd is > > restarted. > > The osd log displays "wrong node", because the actual peer address > > is different from the peer address got from the old osdmap. > > Before parts of networks(both the public and cluster networks for > > a range of osds) went down, everything was working well and the > > osdmap epoch is 100 at the time for example. > > Then parts of the networks(both the public and cluster networks) > > went down for 3~5 minutes. > > The influenced osds(osd number is 156 and 50 osds are influenced > > by the failed network) went down by heartbeat check failure. > > After the parts of the networks recovered, all influenced osds > > except one osd (let's say osd 8)went online. > > OSD.8 was down and would not go online although the process for > > osd.8 was running. > > When I checked the osd.8 log, I found that its osdmap was still > > 100 and did not change any more after the network failure. > > But in the ceph cluster, the epoch had increased to a bigger > > epoch like 160. > > Does anyone know some bugfixes related to the problem or some clues? > > Best wishes, > > brandy It sounds to me like it got into a (rare) state where it wasn't chatting with the peer OSDs and didn't hear about the OSDMap change. Perhaps we should add some sort of fail-safe where the OSDs pings the mon periodically for a new map if everything seems (too) quiet... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html