Re: How does a "ceph orch restart SERVICE" affect availability?

Mikael Öhman <micketeer@xxxxxxxxx> · Thu, 22 Jun 2023 19:11:08 +0200

Thank you Eugen!
After finding what the target name actually was it all worked like a charm.

Best regards, Mikael

On Wed, Jun 21, 2023 at 11:05 AM Eugen Block <eblock@xxxxxx> wrote:

> Hi,
>
> > Will that try to be smart and just restart a few at a time to keep things
> > up and available. Or will it just trigger a restart everywhere
> > simultaneously.
>
> basically, that's what happens for example during an upgrade if
> services are restarted. It's designed to be a rolling upgrade
> procedure so restarting all daemons of a specific service at the same
> time would cause an interruption. So the daemons are scheduled to
> restart and the mgr decides when it's safe to restart the next (this
> is a test cluster started on Nautilus, but it's on Quincy now):
>
> nautilus:~ # ceph orch restart osd.osd-hdd-ssd
> Scheduled to restart osd.5 on host 'nautilus'
> Scheduled to restart osd.0 on host 'nautilus'
> Scheduled to restart osd.2 on host 'nautilus'
> Scheduled to restart osd.1 on host 'nautilus2'
> Scheduled to restart osd.4 on host 'nautilus2'
> Scheduled to restart osd.7 on host 'nautilus2'
> Scheduled to restart osd.3 on host 'nautilus3'
> Scheduled to restart osd.8 on host 'nautilus3'
> Scheduled to restart osd.6 on host 'nautilus3'
>
> When it comes to OSDs it's possible (or even likely) that multiple
> OSDs are restarted at the same time, depending on the pools (and their
> replication size) they are part of. But ceph tries to avoid "inactive
> PGs" which is critical, of course. An edge case would be a pool with
> size 1 where restarting an OSD would cause an inactive PG until the
> OSD is up again. But since size 1 would be a bad idea anyway (except
> for testing purposes) you'd have to live with that.
> If you have the option I'd recommend to create a test cluster and play
> around with these things to get a better understanding, especially
> when it comes to upgrade tests etc.
>
> > I guess in my current scenario, restarting one host at the time makes
> most
> > sense, with a
> > systemctl restart ceph-{fsid}.target
> > and then checking that "ceph -s" says OK before proceeding to the next
>
> Yes, if your crush-failure-domain is host that should be safe, too.
>
> Regards,
> Eugen
>
> Zitat von Mikael Öhman <micketeer@xxxxxxxxx>:
>
> > The documentation very briefly explains a few core commands for
> restarting
> > things;
> >
> https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-daemons
> > but I feel I'm lacking quite some details of what is safe to do.
> >
> > I have a system in production, clusters connected via CephFS and some
> > shared block devices.
> > We would like to restart some things due to some new network
> > configurations. Going daemon by daemon would take forever, so I'm curious
> > as to what happens if one tries the command;
> >
> > ceph orch restart osd
> >
> > Will that try to be smart and just restart a few at a time to keep things
> > up and available. Or will it just trigger a restart everywhere
> > simultaneously.
> >
> > I guess in my current scenario, restarting one host at the time makes
> most
> > sense, with a
> > systemctl restart ceph-{fsid}.target
> > and then checking that "ceph -s" says OK before proceeding to the next
> > host, but I'm still curious as to what the "ceph orch restart xxx"
> command
> > would do (but not enough to try it out in production)
> >
> > Best regards, Mikael
> > Chalmers University of Technology
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx