Re: wrong public_ip after blackout / poweroutage

Eugen Block <eblock@xxxxxx> · Fri, 21 Jun 2024 01:35:52 +0000

Hi,

this only a theory, not a proven answer or something. But the  
orchestrator does automatically reconfigure daemons depending on the  
circumstances. So my theory is, some of the OSD nodes didn't respond  
via public network anymore, so ceph tried to use the cluster network  
as a fallback. The other way around is more common: if you don't have  
a cluster network configured at all, you see logs stating "falling  
back to public interface" (or similar). If the orchestrator did  
reconfigure the daemons, it would have been logged in the active mgr.  
And the result would be a different ceph.conf for the daemons in  
/var/lib/ceph/{FSID}/osd.{OSD_ID}/config. If you still have the mgr  
logs from after the outage you might find some clues.

Regards,
Eugen

Zitat von mailing-lists <mailing-lists@xxxxxxxxx>:

Dear Cephers,

after a succession of unfortunate events, we have suffered a  
complete datacenter blackout today.

Ceph _nearly_ perfectly came back up. The Health was OK and all  
services were online, but we were having weird problems. Weird as  
in, we could sometimes map rbds and sometimes not, and sometimes we  
could use the cephfs and sometimes we could not...

Turns out, some osds (id say 5%) came back with the cluster_ip  
address as their public_ip and thus were not reachable.

I do not see any pattern, why some osds are faulty and others are  
not. The fault is spread over nearly all nodes. This is an example:

osd.45 up   in  weight 1 up_from 184143 up_thru 184164 down_at  
184142 last_clean_interval [182655,184103)  
[v2:192.168.222.20:6834/1536394698,v1:192.168.222.20:6842/1536394698]  
[v2:192.168.222.20:6848/1536394698,v1:192.168.222.20:6853/1536394698]  
exists,up 002326c9

This should have a public_ip in the first brackets []. Our  
cluster-network is 192.168.222.0/24, which is of course only  
available on the ceph internal switch.

Simply restarting the osds that were affected solved this problem...  
So I am not really asking for your help troubleshooting this; I  
would just like to understand if there is a reasonable explanation.

My guess would be some kind of race-condition when the interfaces  
came up, but then again, why on ~5% of all osds? ... Anyways im  
tired, I hope that this mail is somewhat understandable.

We are running Ceph 17.2.7 with cephadm docker deployment.

If you have any ideas for the cause of this, please let me know. I  
have not seen this issue when I'm gracefully rebooting the nodes.

Best

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx