Re: OSDs down after reweight

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I think you misinterpret the reason I posted the output from our test cluster. Firstly, there is zero load and OSDs stayed up here because of that. They went down on the production cluster after performing a similar stunt. I'm not going to do this again just to show a ceph status with down OSDs.

What I'm after with the test cluster output is: why are the mappings with crush-weight=0.5*0.27190 reweight=1.0 not identical to the mappings with crush-weight=0.27190 reweight=0.5. According to how I understand the documentation (https://docs.ceph.com/en/pacific/rados/operations/control/#osd-subsystem , search for "Set the override weight (reweight)") the crush algorithm will look at OSDs with a weight computed as crush-weight*reweight, which implies that results should be identical. However, this is obviously not true.

When you look at the large amount of broken mappings in the case crush-weight=0.27190 reweight=0.5, it is not very surprising to see slow ops and OSDs going down under load, its a completely dysfunctional set of PG mappings.

What I'm trying to say here is, that the OSD down observation is almost certainly a result of the failed mappings when using reweight. And this should not happen, it smells like a really bad bug.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Etienne Menguy <etienne.menguy@xxxxxxxxxxx>
Sent: 15 November 2022 10:45:19
To: Frank Schilder; ceph-users@xxxxxxx
Subject: RE: OSDs down after reweight

Hi,
You probably caused a large rebalance and overload your slow HDD. But all OSD are up in what you are sharing.

Also, I see you changed weight and reweight values, that's what you wanted?

Étienne

> -----Original Message-----
> From: Frank Schilder <frans@xxxxxx>
> Sent: mardi 15 novembre 2022 10:38
> To: ceph-users@xxxxxxx
> Subject:  Re: OSDs down after reweight
>
> Here is how this looks on a test cluster:
>
> # ceph osd tree
> ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
> -1         2.44707  root default
> -3         0.81569      host tceph-01
>  0    hdd  0.27190          osd.0          up   1.00000  1.00000
>  2    hdd  0.27190          osd.2          up   1.00000  1.00000
>  4    hdd  0.27190          osd.4          up   1.00000  1.00000
> -7         0.81569      host tceph-02
>  6    hdd  0.27190          osd.6          up   1.00000  1.00000
>  7    hdd  0.27190          osd.7          up   1.00000  1.00000
>  8    hdd  0.27190          osd.8          up   1.00000  1.00000
> -5         0.81569      host tceph-03
>  1    hdd  0.27190          osd.1          up   1.00000  1.00000
>  3    hdd  0.27190          osd.3          up   1.00000  1.00000
>  5    hdd  0.27190          osd.5          up   1.00000  1.00000
>
> # ceph pg dump pgs_brief | head -8
> PG_STAT  STATE                   UP             UP_PRIMARY  ACTING
> ACTING_PRIMARY
> 3.7e               active+clean  [6,0,2,5,3,7]           6  [6,0,2,5,3,7]               6
> 2.7f               active+clean        [7,5,2]           7        [7,5,2]               7
> 2.7e               active+clean        [0,1,8]           0        [0,1,8]               0
> 3.7c               active+clean  [6,5,0,7,2,8]           6  [6,5,0,7,2,8]               6
> 2.7d               active+clean        [0,8,3]           0        [0,8,3]               0
> 3.7d               active+clean  [7,0,3,8,1,2]           7  [7,0,3,8,1,2]               7
>
>
> After osd reweight to 0.5:
>
> # ceph osd tree
> ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
> -1         2.44707  root default
> -3         0.81569      host tceph-01
>  0    hdd  0.27190          osd.0          up   0.50000  1.00000
>  2    hdd  0.27190          osd.2          up   0.50000  1.00000
>  4    hdd  0.27190          osd.4          up   0.50000  1.00000
> -7         0.81569      host tceph-02
>  6    hdd  0.27190          osd.6          up   0.50000  1.00000
>  7    hdd  0.27190          osd.7          up   0.50000  1.00000
>  8    hdd  0.27190          osd.8          up   0.50000  1.00000
> -5         0.81569      host tceph-03
>  1    hdd  0.27190          osd.1          up   0.50000  1.00000
>  3    hdd  0.27190          osd.3          up   0.50000  1.00000
>  5    hdd  0.27190          osd.5          up   0.50000  1.00000
>
> # ceph pg dump pgs_brief | head -8
> PG_STAT  STATE                          UP                                        UP_PRIMARY  ACTING
> ACTING_PRIMARY
> 3.7e     active+remapped+backfill_wait                    [6,0,4,5,1,2147483647]           6
> [6,0,2,5,3,7]               6
> 2.7f                      active+clean                                   [7,5,2]           7        [7,5,2]               7
> 3.7f     active+remapped+backfill_wait           [1,2147483647,7,8,2147483647,2]
> 1  [0,5,4,8,6,2]               0
> 2.7e     active+remapped+backfill_wait                                   [5,4,8]           5
> [1,8,0]               1
> 3.7c     active+remapped+backfill_wait
> [2147483647,1,0,2147483647,2147483647,8]           1  [6,5,0,7,2,8]               6
> 2.7d     active+remapped+backfill_wait                                   [0,3,6]           0
> [0,3,8]               0
> 3.7d     active+remapped+backfill_wait                    [2147483647,0,3,6,4,2]           0
> [7,0,3,8,1,2]               7
>
>
> After osd crush reweight to 0.5*0.27190:
>
> # ceph osd tree
> ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
> -1         1.22346  root default
> -3         0.40782      host tceph-01
>  0    hdd  0.13594          osd.0          up   1.00000  1.00000
>  2    hdd  0.13594          osd.2          up   1.00000  1.00000
>  4    hdd  0.13594          osd.4          up   1.00000  1.00000
> -7         0.40782      host tceph-02
>  6    hdd  0.13594          osd.6          up   1.00000  1.00000
>  7    hdd  0.13594          osd.7          up   1.00000  1.00000
>  8    hdd  0.13594          osd.8          up   1.00000  1.00000
> -5         0.40782      host tceph-03
>  1    hdd  0.13594          osd.1          up   1.00000  1.00000
>  3    hdd  0.13594          osd.3          up   1.00000  1.00000
>  5    hdd  0.13594          osd.5          up   1.00000  1.00000
>
> # ceph pg dump pgs_brief | head -8
> PG_STAT  STATE                        UP             UP_PRIMARY  ACTING
> ACTING_PRIMARY
> 3.7e                    active+clean  [6,0,2,5,3,7]           6  [6,0,2,5,3,7]               6
> 2.7f                    active+clean        [7,5,2]           7        [7,5,2]               7
> 3.7f                    active+clean  [0,5,4,8,6,2]           0  [0,5,4,8,6,2]               0
> 2.7e                    active+clean        [0,1,8]           0        [0,1,8]               0
> 3.7c                    active+clean  [6,5,0,7,2,8]           6  [6,5,0,7,2,8]               6
> 2.7d                    active+clean        [0,8,3]           0        [0,8,3]               0
> 3.7d                    active+clean  [7,0,3,8,1,2]           7  [7,0,3,8,1,2]               7
>
>
> According to the documentation, I would expect identical mappings in all 3
> cases. Can someone help me out here?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <frans@xxxxxx>
> Sent: 15 November 2022 10:09:10
> To: ceph-users@xxxxxxx
> Subject:  OSDs down after reweight
>
> Hi all,
>
> I re-weighted all OSDs in a pool down from 1.0 to the same value 0.052 (see
> reason below). After this, all hell broke loose. OSDs were marked down, slow
> OPS all over the place and the MDSes started complaining about slow
> ops/requests. Basically all PGs were remapped. After setting all re-weights
> back to 1.0 the situation went back to normal.
>
> Expected behaviour: No (!!!) PGs are remapped and everything continues to
> work. Why did things go down?
>
> More details: We have 24 OSDs with weight=1.74699 in a pool. I wanted to
> add OSDs with weight=0.09099 in such a way that the small OSDs receive
> approximately the same number of PGs as the large ones. Setting a re-
> weight factor of 0.052 for the large ones should achieve just that:
> 1.74699*0.05=0.09084. So, procedure was:
>
> - ceph osd crush reweight osd.N 0.052 for all OSDs in that pool
> - add the small disks and re-balance
>
> I would expect that the crush mapping is invariant under a uniform change of
> weight. That is, if I apply the same relative weight-change to all OSDs
> (new_weight=old_weight*common_factor) in a pool, the mappings should
> be preserved. However, this is not what I observed. How is it possible that
> PG mappings change if the relative weight of all OSDs to each other stays the
> same (the probabilities of picking an OSD are unchanged over all OSDs)?
>
> Thanks for any hints.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email
> to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email
> to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux