Re: wrong public_ip after blackout / poweroutage

"David C." <david.casier@xxxxxxxx> · Fri, 21 Jun 2024 11:34:21 +0200

Hi,

This type of incident is often resolved by setting the public_network
option to the "global" scope, in the configuration:

ceph config set global public_network a:b:c:d::/64

Le ven. 21 juin 2024 à 03:36, Eugen Block <eblock@xxxxxx> a écrit :

> Hi,
>
> this only a theory, not a proven answer or something. But the
> orchestrator does automatically reconfigure daemons depending on the
> circumstances. So my theory is, some of the OSD nodes didn't respond
> via public network anymore, so ceph tried to use the cluster network
> as a fallback. The other way around is more common: if you don't have
> a cluster network configured at all, you see logs stating "falling
> back to public interface" (or similar). If the orchestrator did
> reconfigure the daemons, it would have been logged in the active mgr.
> And the result would be a different ceph.conf for the daemons in
> /var/lib/ceph/{FSID}/osd.{OSD_ID}/config. If you still have the mgr
> logs from after the outage you might find some clues.
>
> Regards,
> Eugen
>
> Zitat von mailing-lists <mailing-lists@xxxxxxxxx>:
>
> > Dear Cephers,
> >
> > after a succession of unfortunate events, we have suffered a
> > complete datacenter blackout today.
> >
> >
> > Ceph _nearly_ perfectly came back up. The Health was OK and all
> > services were online, but we were having weird problems. Weird as
> > in, we could sometimes map rbds and sometimes not, and sometimes we
> > could use the cephfs and sometimes we could not...
> >
> > Turns out, some osds (id say 5%) came back with the cluster_ip
> > address as their public_ip and thus were not reachable.
> >
> > I do not see any pattern, why some osds are faulty and others are
> > not. The fault is spread over nearly all nodes. This is an example:
> >
> > osd.45 up   in  weight 1 up_from 184143 up_thru 184164 down_at
> > 184142 last_clean_interval [182655,184103)
> > [v2:192.168.222.20:6834/1536394698,v1:192.168.222.20:6842/1536394698]
> > [v2:192.168.222.20:6848/1536394698,v1:192.168.222.20:6853/1536394698]
> > exists,up 002326c9
> >
> > This should have a public_ip in the first brackets []. Our
> > cluster-network is 192.168.222.0/24, which is of course only
> > available on the ceph internal switch.
> >
> > Simply restarting the osds that were affected solved this problem...
> > So I am not really asking for your help troubleshooting this; I
> > would just like to understand if there is a reasonable explanation.
> >
> > My guess would be some kind of race-condition when the interfaces
> > came up, but then again, why on ~5% of all osds? ... Anyways im
> > tired, I hope that this mail is somewhat understandable.
> >
> >
> > We are running Ceph 17.2.7 with cephadm docker deployment.
> >
> >
> > If you have any ideas for the cause of this, please let me know. I
> > have not seen this issue when I'm gracefully rebooting the nodes.
> >
> >
> > Best
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx