Re: How does a "ceph orch restart SERVICE" affect availability?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



Will that try to be smart and just restart a few at a time to keep things
up and available. Or will it just trigger a restart everywhere

basically, that's what happens for example during an upgrade if services are restarted. It's designed to be a rolling upgrade procedure so restarting all daemons of a specific service at the same time would cause an interruption. So the daemons are scheduled to restart and the mgr decides when it's safe to restart the next (this is a test cluster started on Nautilus, but it's on Quincy now):

nautilus:~ # ceph orch restart osd.osd-hdd-ssd
Scheduled to restart osd.5 on host 'nautilus'
Scheduled to restart osd.0 on host 'nautilus'
Scheduled to restart osd.2 on host 'nautilus'
Scheduled to restart osd.1 on host 'nautilus2'
Scheduled to restart osd.4 on host 'nautilus2'
Scheduled to restart osd.7 on host 'nautilus2'
Scheduled to restart osd.3 on host 'nautilus3'
Scheduled to restart osd.8 on host 'nautilus3'
Scheduled to restart osd.6 on host 'nautilus3'

When it comes to OSDs it's possible (or even likely) that multiple OSDs are restarted at the same time, depending on the pools (and their replication size) they are part of. But ceph tries to avoid "inactive PGs" which is critical, of course. An edge case would be a pool with size 1 where restarting an OSD would cause an inactive PG until the OSD is up again. But since size 1 would be a bad idea anyway (except for testing purposes) you'd have to live with that. If you have the option I'd recommend to create a test cluster and play around with these things to get a better understanding, especially when it comes to upgrade tests etc.

I guess in my current scenario, restarting one host at the time makes most
sense, with a
systemctl restart ceph-{fsid}.target
and then checking that "ceph -s" says OK before proceeding to the next

Yes, if your crush-failure-domain is host that should be safe, too.


Zitat von Mikael Öhman <micketeer@xxxxxxxxx>:

The documentation very briefly explains a few core commands for restarting
but I feel I'm lacking quite some details of what is safe to do.

I have a system in production, clusters connected via CephFS and some
shared block devices.
We would like to restart some things due to some new network
configurations. Going daemon by daemon would take forever, so I'm curious
as to what happens if one tries the command;

ceph orch restart osd

Will that try to be smart and just restart a few at a time to keep things
up and available. Or will it just trigger a restart everywhere

I guess in my current scenario, restarting one host at the time makes most
sense, with a
systemctl restart ceph-{fsid}.target
and then checking that "ceph -s" says OK before proceeding to the next
host, but I'm still curious as to what the "ceph orch restart xxx" command
would do (but not enough to try it out in production)

Best regards, Mikael
Chalmers University of Technology
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux