Hi,
this only a theory, not a proven answer or something. But the
orchestrator does automatically reconfigure daemons depending on the
circumstances. So my theory is, some of the OSD nodes didn't respond
via public network anymore, so ceph tried to use the cluster network
as a fallback. The other way around is more common: if you don't have
a cluster network configured at all, you see logs stating "falling
back to public interface" (or similar). If the orchestrator did
reconfigure the daemons, it would have been logged in the active mgr.
And the result would be a different ceph.conf for the daemons in
/var/lib/ceph/{FSID}/osd.{OSD_ID}/config. If you still have the mgr
logs from after the outage you might find some clues.
Regards,
Eugen
Zitat von mailing-lists <mailing-lists@xxxxxxxxx>:
Dear Cephers,
after a succession of unfortunate events, we have suffered a
complete datacenter blackout today.
Ceph _nearly_ perfectly came back up. The Health was OK and all
services were online, but we were having weird problems. Weird as
in, we could sometimes map rbds and sometimes not, and sometimes we
could use the cephfs and sometimes we could not...
Turns out, some osds (id say 5%) came back with the cluster_ip
address as their public_ip and thus were not reachable.
I do not see any pattern, why some osds are faulty and others are
not. The fault is spread over nearly all nodes. This is an example:
osd.45 up in weight 1 up_from 184143 up_thru 184164 down_at
184142 last_clean_interval [182655,184103)
[v2:192.168.222.20:6834/1536394698,v1:192.168.222.20:6842/1536394698]
[v2:192.168.222.20:6848/1536394698,v1:192.168.222.20:6853/1536394698]
exists,up 002326c9
This should have a public_ip in the first brackets []. Our
cluster-network is 192.168.222.0/24, which is of course only
available on the ceph internal switch.
Simply restarting the osds that were affected solved this problem...
So I am not really asking for your help troubleshooting this; I
would just like to understand if there is a reasonable explanation.
My guess would be some kind of race-condition when the interfaces
came up, but then again, why on ~5% of all osds? ... Anyways im
tired, I hope that this mail is somewhat understandable.
We are running Ceph 17.2.7 with cephadm docker deployment.
If you have any ideas for the cause of this, please let me know. I
have not seen this issue when I'm gracefully rebooting the nodes.
Best
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx