Just adding our feedback - this is affecting us as well. We reboot periodically to test durability of the clusters we run, and this is fairly impactful. I could see power loss/other scenarios in which this could end quite poorly for those with less than perfect redundancy in DCs across multiple racks/PDUs/etc. I see https://github.com/ceph/ceph/pull/42690 has been submitted, but I'd definitely make an argument for it being a 'very high' priority, so it hopefully gets a review in time for 16.2.6. :) David On Tue, Aug 10, 2021 at 4:36 AM Sebastian Wagner <sewagner@xxxxxxxxxx> wrote: > > Good morning Robert, > > Am 10.08.21 um 09:53 schrieb Robert Sander: > > Hi, > > > > Am 09.08.21 um 20:44 schrieb Adam King: > > > >> This issue looks the same as https://tracker.ceph.com/issues/51027 > >> which is > >> being worked on. Essentially, it seems that hosts that were being > >> rebooted > >> were temporarily marked as offline and cephadm had an issue where it > >> would > >> try to remove all daemons (outside of osds I believe) from offline > >> hosts. > > > > Sorry for maybe being rude but how on earth does one come up with the > > idea to automatically remove components from a cluster where just one > > node is currently rebooting without any operator interference? > > Obviously no one :-). We already have over 750 tests for the cephadm > scheduler and I can foresee that we'll get some additional ones for this > case as well. > > Kind regards, > > Sebastian > > > > > > Regards > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx