Hi,
Will that try to be smart and just restart a few at a time to keep things
up and available. Or will it just trigger a restart everywhere
simultaneously.
basically, that's what happens for example during an upgrade if
services are restarted. It's designed to be a rolling upgrade
procedure so restarting all daemons of a specific service at the same
time would cause an interruption. So the daemons are scheduled to
restart and the mgr decides when it's safe to restart the next (this
is a test cluster started on Nautilus, but it's on Quincy now):
nautilus:~ # ceph orch restart osd.osd-hdd-ssd
Scheduled to restart osd.5 on host 'nautilus'
Scheduled to restart osd.0 on host 'nautilus'
Scheduled to restart osd.2 on host 'nautilus'
Scheduled to restart osd.1 on host 'nautilus2'
Scheduled to restart osd.4 on host 'nautilus2'
Scheduled to restart osd.7 on host 'nautilus2'
Scheduled to restart osd.3 on host 'nautilus3'
Scheduled to restart osd.8 on host 'nautilus3'
Scheduled to restart osd.6 on host 'nautilus3'
When it comes to OSDs it's possible (or even likely) that multiple
OSDs are restarted at the same time, depending on the pools (and their
replication size) they are part of. But ceph tries to avoid "inactive
PGs" which is critical, of course. An edge case would be a pool with
size 1 where restarting an OSD would cause an inactive PG until the
OSD is up again. But since size 1 would be a bad idea anyway (except
for testing purposes) you'd have to live with that.
If you have the option I'd recommend to create a test cluster and play
around with these things to get a better understanding, especially
when it comes to upgrade tests etc.
I guess in my current scenario, restarting one host at the time makes most
sense, with a
systemctl restart ceph-{fsid}.target
and then checking that "ceph -s" says OK before proceeding to the next
Yes, if your crush-failure-domain is host that should be safe, too.
Regards,
Eugen
Zitat von Mikael Öhman <micketeer@xxxxxxxxx>:
The documentation very briefly explains a few core commands for restarting
things;
https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-daemons
but I feel I'm lacking quite some details of what is safe to do.
I have a system in production, clusters connected via CephFS and some
shared block devices.
We would like to restart some things due to some new network
configurations. Going daemon by daemon would take forever, so I'm curious
as to what happens if one tries the command;
ceph orch restart osd
Will that try to be smart and just restart a few at a time to keep things
up and available. Or will it just trigger a restart everywhere
simultaneously.
I guess in my current scenario, restarting one host at the time makes most
sense, with a
systemctl restart ceph-{fsid}.target
and then checking that "ceph -s" says OK before proceeding to the next
host, but I'm still curious as to what the "ceph orch restart xxx" command
would do (but not enough to try it out in production)
Best regards, Mikael
Chalmers University of Technology
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx