Re: [ceph-users] Re: Why set osd flag to noout during upgrade ?

Sebastian Wagner <sewagner@xxxxxxxxxx> · Mon, 27 Sep 2021 11:53:21 +0200

Hi Dan and Etienne,

that's interesting. moving this to dev@

Am 22.09.21 um 12:26 schrieb Dan van der Ster:
Yeah you don't want to deal with backfilling while the cluster is
upgrading. At best it can delay the upgrade, at worst mixed version
backfilling has (rarely) caused issues in the past.

We additionally `set noin` and disable the balancer: `ceph balancer off`.
The former prevents broken osds from re-entering the cluster, and both of
these similarly prevent backfilling from starting mid-upgrade.

From my point of view, I'd rather keep the safety net (in this case the 
balancer) enabled in order to avoid having to deal with unbalanced OSDs, 
in case the upgrade takes longer, even if that means some data movement 
during an upgrade.

Especially for the big majority of the clusters out there, I prefer to 
avoid having full OSDs. Is that reasonable?

.. Dan

On Wed, 22 Sep 2021, 12:18 Etienne Menguy, <etienne.menguy@xxxxxxxx> wrote:

Hello,

 From my experience, I see three reasons :
- You don’t want to recover data if you already have them on a down OSD,
rebalancing can have a big impact on performance
- If upgrade/maintenance goes wrong you will want to focus on this issue
and not have to deal with things done by Ceph meanwhile.
- During an upgrade you have an ‘unusual’ cluster with different versions.
It’s supposed to work, but you probably want to keep it ‘boring’.

-
Etienne Menguy
etienne.menguy@xxxxxxxx

On 22 Sep 2021, at 11:51, Francois Legrand <fleg@xxxxxxxxxxxxxx> wrote:

Hello everybody,

I have a "stupid" question. Why is it recommended in the docs to set the
osd flag to noout during an upgrade/maintainance (and especially during an
osd upgrade/maintainance) ?
In my understanding, if an osd goes down, after a while (600s by
default) it's marked out and the cluster will start to rebuild it's content
elsewhere in the cluster to maintain the redondancy of the datas. This
generate some transfer and load on other osds, but that's not a big deal !
As soon as the osd is back, it's marked in again and ceph is able to
determine which data is back and stop the recovery to reuse the unchanged
datas which are back. Generally, the recovery is as fast as with noout flag
(because with noout, the data modified during the down period still have be
copied to the back osd).
Thus is there an other reason apart from limiting the transfer and
others osds load durind the downtime ?
F

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Attachment:
OpenPGP_signature

Description: OpenPGP digital signature
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx