Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

Harry G Coin <hgcoin@xxxxxxxxx> · Thu, 17 Oct 2024 00:35:51 -0500

Hi Frédéric

All was normal in v18, after 19.2 the problem remains even though the 
addresses are different:

cluster_network global: fc00:1000:0:b00::/64

public_network global: fc00:1002:c7::/64

Also, after rebooting everything in sequence, it only complains that the 
27 osd that are both up, in and working normally remain also "not 
reachable".

~# ceph -s
  cluster:
    id:     ...
    health: HEALTH_ERR
            27 osds(s) are not reachable

  services:
...

    osd: 27 osds: 27 up (since 6m), 27 in (since 12d)

On 10/16/24 03:44, Frédéric Nass wrote:
Hi Harry,

Do you have a 'cluster_network' set to the same subnet as the 'public_network' like in the issue [1]? Doesn't make much sens setting up a cluster_network when it's not different than the public_network.
Maybe that's what triggers the OSD_UNREACHABLE recently coded here [2] (even though it seems the code only considers IPv4 addresses, which seems odd, btw.)

I suggest removing the cluster_network and restart a single OSD to see if the counter decreases.

Regards,
Frédéric.

[1]https://tracker.ceph.com/issues/67517
[2]https://github.com/ceph/ceph/commit/5b70a6b92079f9e9d5d899eceebc1a62dae72997

----- Le 16 Oct 24, à 3:02, Harry G Coinhgcoin@xxxxxxxxx a écrit :

Thanks for the notion!  I did that, the result was no change to the
problem, but with the added ceph -s complaint "Public/cluster network
defined, but can not be found on any host"  -- with otherwise totally
normal cluster operations.  Go figure.  How can ceph -s be so totally
wrong, the dashboard reporting critical problems -- except there are
none.   Makes me really wonder whether any actual testing on ipv6 is
ever done before releases are marked 'stable'.

HC

On 10/14/24 21:04, Anthony D'Atri wrote:
Try failing over to a standby mgr

On Oct 14, 2024, at 9:33 PM, Harry G Coin<hgcoin@xxxxxxxxx> wrote:

I need help to remove a useless "HEALTH ERR" in 19.2.0 on a fully dual stack
docker setup with ceph using ip v6, public and private nets separated, with a
few servers.   After upgrading from an error free v18 rev, I can't get rid of
the 'health err' owing to the report that all osds are unreachable.  Meanwhile
ceph -s reports all osds up and in and the cluster otherwise operates normally.
I don't care if it's 'a real fix'  I just need to remove the false error
report.   Any ideas?

Thanks

Harry Coin

_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an emailtoceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx