Re: ceph noout vs ceph norebalance, which is better for minor maintenance

Konstantin Shalygin <k0ste@xxxxxxxx> · Sat, 18 Feb 2023 00:18:34 +0700

> On 17 Feb 2023, at 23:20, Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:
> 
> 
> 
>> * if rebalance will starts due EDAC or SFP degradation, is faster to fix the issue via DC engineers and put node back to work
> 
> A judicious mon_osd_down_out_subtree_limit setting can also do this by not rebalancing when an entire node is detected down. 

Yes. But in this case when single disk dead, it's may be not actually dead, the examples:

* disk just stuck - reboot or/and physical inject_insert return in to live
* disk read errors - such errors lead to OSD down, but after OSD restart is just works normal (Pending Sectors -> Reallocates)

The fill of single 16TB OSD may be a 7-10 days. And it's may be fixed with 10-20 minutes with duty engineer

> 
>> * noout prevents unwanted OSD's fills and the run out of space => outage of services
> 
> Do you run your clusters very full?

We provide public services. This means client can rent 1000 disks x 1000GB via one terraform command, at 02:00 Saturday night. Just physically impossible to add nodes at this case. Any movement without upmap is highly undesirable

k
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx