Re: [OSDMAP]osdmap did not update after network recovered from failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 06/21/2018 08:29 PM, Sage Weil wrote:
On Thu, 21 Jun 2018, cgxu519 wrote:
On 06/20/2018 10:45 PM, Sage Weil wrote:
On Wed, 20 Jun 2018, cgxu519 wrote:
Is there any specific log indicates what was happening?


On 06/19/2018 09:56 PM, xiangyang yu wrote:
    Hi cephers,
      Recently i met a problem in our production environment.
       My ceph version is hammer 0.94.5(it's too old though.)
       Osdmap(in the osd process) did not update epoch until the osd is
restarted.
       The osd log displays "wrong node", because the actual peer address
is different from the peer address got from the old osdmap.
So first of all, I would like  to know what is wrong here, the peer's address
or the troubled osd itself?
More information?
The troubled osd has a very old osdmap, and when it connects to its peers
(using old addresses) it finds different, newer instances of osds there
instead.  It's a symptom of it having old address info.

I think the fix is to strategically place an osdmap_subscribe() call
somewhere where we find that our heartbeat checks are failing (either
immediately or after some time).  That will query the mon for the latest
osdmap and bring it back up to date.  It should do this periodically but
not too frequently to avoid loading the mon heavily in a large cluster.
Could we make this call only when detecting "wrong node" ?
Does it make sense?

Thanks,
Chengguang.



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux