Dear Cephers,
after a succession of unfortunate events, we have suffered a complete
datacenter blackout today.
Ceph _nearly_ perfectly came back up. The Health was OK and all services
were online, but we were having weird problems. Weird as in, we could
sometimes map rbds and sometimes not, and sometimes we could use the
cephfs and sometimes we could not...
Turns out, some osds (id say 5%) came back with the cluster_ip address
as their public_ip and thus were not reachable.
I do not see any pattern, why some osds are faulty and others are not.
The fault is spread over nearly all nodes. This is an example:
osd.45 up in weight 1 up_from 184143 up_thru 184164 down_at 184142
last_clean_interval [182655,184103)
[v2:192.168.222.20:6834/1536394698,v1:192.168.222.20:6842/1536394698]
[v2:192.168.222.20:6848/1536394698,v1:192.168.222.20:6853/1536394698]
exists,up 002326c9
This should have a public_ip in the first brackets []. Our
cluster-network is 192.168.222.0/24, which is of course only available
on the ceph internal switch.
Simply restarting the osds that were affected solved this problem... So
I am not really asking for your help troubleshooting this; I would just
like to understand if there is a reasonable explanation.
My guess would be some kind of race-condition when the interfaces came
up, but then again, why on ~5% of all osds? ... Anyways im tired, I hope
that this mail is somewhat understandable.
We are running Ceph 17.2.7 with cephadm docker deployment.
If you have any ideas for the cause of this, please let me know. I have
not seen this issue when I'm gracefully rebooting the nodes.
Best
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx