Re: [OSDMAP]osdmap did not update after network recovered from failure

cgxu519 <cgxu519@xxxxxxx> · Wed, 20 Jun 2018 10:00:19 +0800

Is there any specific log indicates what was happening?

On 06/19/2018 09:56 PM, xiangyang yu wrote:
  Hi cephers,
    Recently i met a problem in our production environment.
     My ceph version is hammer 0.94.5(it's too old though.)
     Osdmap(in the osd process) did not update epoch until the osd is restarted.
     The osd log displays "wrong node", because the actual peer address
is different from the peer address got from the old osdmap.
     Before parts of networks(both the public and cluster networks for
a range of osds) went down, everything was working well  and the
osdmap epoch is 100 at the time for example.
     Then parts of the networks(both the public and cluster networks)
went down for 3~5 minutes.
     The influenced osds(osd number is 156 and 50 osds are influenced
by the failed network) went down by  heartbeat check failure.
     After the parts of the networks recovered, all influenced osds
except one osd (let's say osd 8)went online.
     OSD.8 was down and would not go online although the process for
osd.8 was running.
     When I checked  the osd.8 log, I found that its osdmap was still
100 and did not change any more after the network failure.
     But in the ceph cluster, the epoch had increased to a  bigger
epoch like 160.
     Does anyone know some bugfixes related to the problem or some clues?
     Best wishes,
     brandy
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html