Re: wrong public_ip after blackout / poweroutage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

this only a theory, not a proven answer or something. But the orchestrator does automatically reconfigure daemons depending on the circumstances. So my theory is, some of the OSD nodes didn't respond via public network anymore, so ceph tried to use the cluster network as a fallback. The other way around is more common: if you don't have a cluster network configured at all, you see logs stating "falling back to public interface" (or similar). If the orchestrator did reconfigure the daemons, it would have been logged in the active mgr. And the result would be a different ceph.conf for the daemons in /var/lib/ceph/{FSID}/osd.{OSD_ID}/config. If you still have the mgr logs from after the outage you might find some clues.

Regards,
Eugen

Zitat von mailing-lists <mailing-lists@xxxxxxxxx>:

Dear Cephers,

after a succession of unfortunate events, we have suffered a complete datacenter blackout today.


Ceph _nearly_ perfectly came back up. The Health was OK and all services were online, but we were having weird problems. Weird as in, we could sometimes map rbds and sometimes not, and sometimes we could use the cephfs and sometimes we could not...

Turns out, some osds (id say 5%) came back with the cluster_ip address as their public_ip and thus were not reachable.

I do not see any pattern, why some osds are faulty and others are not. The fault is spread over nearly all nodes. This is an example:

osd.45 up   in  weight 1 up_from 184143 up_thru 184164 down_at 184142 last_clean_interval [182655,184103) [v2:192.168.222.20:6834/1536394698,v1:192.168.222.20:6842/1536394698] [v2:192.168.222.20:6848/1536394698,v1:192.168.222.20:6853/1536394698] exists,up 002326c9

This should have a public_ip in the first brackets []. Our cluster-network is 192.168.222.0/24, which is of course only available on the ceph internal switch.

Simply restarting the osds that were affected solved this problem... So I am not really asking for your help troubleshooting this; I would just like to understand if there is a reasonable explanation.

My guess would be some kind of race-condition when the interfaces came up, but then again, why on ~5% of all osds? ... Anyways im tired, I hope that this mail is somewhat understandable.


We are running Ceph 17.2.7 with cephadm docker deployment.


If you have any ideas for the cause of this, please let me know. I have not seen this issue when I'm gracefully rebooting the nodes.


Best

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux