Re: How does a "ceph orch restart SERVICE" affect availability?

Eugen Block <eblock@xxxxxx> · Wed, 21 Jun 2023 09:03:11 +0000

Hi,

Will that try to be smart and just restart a few at a time to keep things
up and available. Or will it just trigger a restart everywhere
simultaneously.

basically, that's what happens for example during an upgrade if  
services are restarted. It's designed to be a rolling upgrade  
procedure so restarting all daemons of a specific service at the same  
time would cause an interruption. So the daemons are scheduled to  
restart and the mgr decides when it's safe to restart the next (this  
is a test cluster started on Nautilus, but it's on Quincy now):

nautilus:~ # ceph orch restart osd.osd-hdd-ssd
Scheduled to restart osd.5 on host 'nautilus'
Scheduled to restart osd.0 on host 'nautilus'
Scheduled to restart osd.2 on host 'nautilus'
Scheduled to restart osd.1 on host 'nautilus2'
Scheduled to restart osd.4 on host 'nautilus2'
Scheduled to restart osd.7 on host 'nautilus2'
Scheduled to restart osd.3 on host 'nautilus3'
Scheduled to restart osd.8 on host 'nautilus3'
Scheduled to restart osd.6 on host 'nautilus3'

When it comes to OSDs it's possible (or even likely) that multiple  
OSDs are restarted at the same time, depending on the pools (and their  
replication size) they are part of. But ceph tries to avoid "inactive  
PGs" which is critical, of course. An edge case would be a pool with  
size 1 where restarting an OSD would cause an inactive PG until the  
OSD is up again. But since size 1 would be a bad idea anyway (except  
for testing purposes) you'd have to live with that.
If you have the option I'd recommend to create a test cluster and play  
around with these things to get a better understanding, especially  
when it comes to upgrade tests etc.

I guess in my current scenario, restarting one host at the time makes most
sense, with a
systemctl restart ceph-{fsid}.target
and then checking that "ceph -s" says OK before proceeding to the next

Yes, if your crush-failure-domain is host that should be safe, too.

Regards,
Eugen

Zitat von Mikael Öhman <micketeer@xxxxxxxxx>:

The documentation very briefly explains a few core commands for restarting
things;
https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-daemons
but I feel I'm lacking quite some details of what is safe to do.

I have a system in production, clusters connected via CephFS and some
shared block devices.
We would like to restart some things due to some new network
configurations. Going daemon by daemon would take forever, so I'm curious
as to what happens if one tries the command;

ceph orch restart osd

Will that try to be smart and just restart a few at a time to keep things
up and available. Or will it just trigger a restart everywhere
simultaneously.

I guess in my current scenario, restarting one host at the time makes most
sense, with a
systemctl restart ceph-{fsid}.target
and then checking that "ceph -s" says OK before proceeding to the next
host, but I'm still curious as to what the "ceph orch restart xxx" command
would do (but not enough to try it out in production)

Best regards, Mikael
Chalmers University of Technology
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx