Re: [OSDMAP]osdmap did not update after network recovered from failure

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 20 Jun 2018 14:45:57 +0000 (UTC)

On Wed, 20 Jun 2018, cgxu519 wrote:
> Is there any specific log indicates what was happening?
> 
> 
> On 06/19/2018 09:56 PM, xiangyang yu wrote:
> >   Hi cephers,
> >     Recently i met a problem in our production environment.
> >      My ceph version is hammer 0.94.5(it's too old though.)
> >      Osdmap(in the osd process) did not update epoch until the osd is
> > restarted.
> >      The osd log displays "wrong node", because the actual peer address
> > is different from the peer address got from the old osdmap.
> >      Before parts of networks(both the public and cluster networks for
> > a range of osds) went down, everything was working well  and the
> > osdmap epoch is 100 at the time for example.
> >      Then parts of the networks(both the public and cluster networks)
> > went down for 3~5 minutes.
> >      The influenced osds(osd number is 156 and 50 osds are influenced
> > by the failed network) went down by  heartbeat check failure.
> >      After the parts of the networks recovered, all influenced osds
> > except one osd (let's say osd 8)went online.
> >      OSD.8 was down and would not go online although the process for
> > osd.8 was running.
> >      When I checked  the osd.8 log, I found that its osdmap was still
> > 100 and did not change any more after the network failure.
> >      But in the ceph cluster, the epoch had increased to a  bigger
> > epoch like 160.
> >      Does anyone know some bugfixes related to the problem or some clues?
> >      Best wishes,
> >      brandy

It sounds to me like it got into a (rare) state where it wasn't chatting 
with the peer OSDs and didn't hear about the OSDMap change.  Perhaps we 
should add some sort of fail-safe where the OSDs pings the mon 
periodically for a new map if everything seems (too) quiet...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html