wrong public_ip after blackout / poweroutage

mailing-lists <mailing-lists@xxxxxxxxx> · Fri, 14 Jun 2024 18:20:30 +0200

Dear Cephers,

after a succession of unfortunate events, we have suffered a complete 
datacenter blackout today.

Ceph _nearly_ perfectly came back up. The Health was OK and all services 
were online, but we were having weird problems. Weird as in, we could 
sometimes map rbds and sometimes not, and sometimes we could use the 
cephfs and sometimes we could not...

Turns out, some osds (id say 5%) came back with the cluster_ip address 
as their public_ip and thus were not reachable.

I do not see any pattern, why some osds are faulty and others are not. 
The fault is spread over nearly all nodes. This is an example:

osd.45 up   in  weight 1 up_from 184143 up_thru 184164 down_at 184142 
last_clean_interval [182655,184103) 
[v2:192.168.222.20:6834/1536394698,v1:192.168.222.20:6842/1536394698] 
[v2:192.168.222.20:6848/1536394698,v1:192.168.222.20:6853/1536394698] 
exists,up 002326c9

This should have a public_ip in the first brackets []. Our 
cluster-network is 192.168.222.0/24, which is of course only available 
on the ceph internal switch.

Simply restarting the osds that were affected solved this problem... So 
I am not really asking for your help troubleshooting this; I would just 
like to understand if there is a reasonable explanation.

My guess would be some kind of race-condition when the interfaces came 
up, but then again, why on ~5% of all osds? ... Anyways im tired, I hope 
that this mail is somewhat understandable.

We are running Ceph 17.2.7 with cephadm docker deployment.

If you have any ideas for the cause of this, please let me know. I have 
not seen this issue when I'm gracefully rebooting the nodes.

Best

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx