I think you misinterpret the reason I posted the output from our test cluster. Firstly, there is zero load and OSDs stayed up here because of that. They went down on the production cluster after performing a similar stunt. I'm not going to do this again just to show a ceph status with down OSDs. What I'm after with the test cluster output is: why are the mappings with crush-weight=0.5*0.27190 reweight=1.0 not identical to the mappings with crush-weight=0.27190 reweight=0.5. According to how I understand the documentation (https://docs.ceph.com/en/pacific/rados/operations/control/#osd-subsystem , search for "Set the override weight (reweight)") the crush algorithm will look at OSDs with a weight computed as crush-weight*reweight, which implies that results should be identical. However, this is obviously not true. When you look at the large amount of broken mappings in the case crush-weight=0.27190 reweight=0.5, it is not very surprising to see slow ops and OSDs going down under load, its a completely dysfunctional set of PG mappings. What I'm trying to say here is, that the OSD down observation is almost certainly a result of the failed mappings when using reweight. And this should not happen, it smells like a really bad bug. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Etienne Menguy <etienne.menguy@xxxxxxxxxxx> Sent: 15 November 2022 10:45:19 To: Frank Schilder; ceph-users@xxxxxxx Subject: RE: OSDs down after reweight Hi, You probably caused a large rebalance and overload your slow HDD. But all OSD are up in what you are sharing. Also, I see you changed weight and reweight values, that's what you wanted? Étienne > -----Original Message----- > From: Frank Schilder <frans@xxxxxx> > Sent: mardi 15 novembre 2022 10:38 > To: ceph-users@xxxxxxx > Subject: Re: OSDs down after reweight > > Here is how this looks on a test cluster: > > # ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 2.44707 root default > -3 0.81569 host tceph-01 > 0 hdd 0.27190 osd.0 up 1.00000 1.00000 > 2 hdd 0.27190 osd.2 up 1.00000 1.00000 > 4 hdd 0.27190 osd.4 up 1.00000 1.00000 > -7 0.81569 host tceph-02 > 6 hdd 0.27190 osd.6 up 1.00000 1.00000 > 7 hdd 0.27190 osd.7 up 1.00000 1.00000 > 8 hdd 0.27190 osd.8 up 1.00000 1.00000 > -5 0.81569 host tceph-03 > 1 hdd 0.27190 osd.1 up 1.00000 1.00000 > 3 hdd 0.27190 osd.3 up 1.00000 1.00000 > 5 hdd 0.27190 osd.5 up 1.00000 1.00000 > > # ceph pg dump pgs_brief | head -8 > PG_STAT STATE UP UP_PRIMARY ACTING > ACTING_PRIMARY > 3.7e active+clean [6,0,2,5,3,7] 6 [6,0,2,5,3,7] 6 > 2.7f active+clean [7,5,2] 7 [7,5,2] 7 > 2.7e active+clean [0,1,8] 0 [0,1,8] 0 > 3.7c active+clean [6,5,0,7,2,8] 6 [6,5,0,7,2,8] 6 > 2.7d active+clean [0,8,3] 0 [0,8,3] 0 > 3.7d active+clean [7,0,3,8,1,2] 7 [7,0,3,8,1,2] 7 > > > After osd reweight to 0.5: > > # ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 2.44707 root default > -3 0.81569 host tceph-01 > 0 hdd 0.27190 osd.0 up 0.50000 1.00000 > 2 hdd 0.27190 osd.2 up 0.50000 1.00000 > 4 hdd 0.27190 osd.4 up 0.50000 1.00000 > -7 0.81569 host tceph-02 > 6 hdd 0.27190 osd.6 up 0.50000 1.00000 > 7 hdd 0.27190 osd.7 up 0.50000 1.00000 > 8 hdd 0.27190 osd.8 up 0.50000 1.00000 > -5 0.81569 host tceph-03 > 1 hdd 0.27190 osd.1 up 0.50000 1.00000 > 3 hdd 0.27190 osd.3 up 0.50000 1.00000 > 5 hdd 0.27190 osd.5 up 0.50000 1.00000 > > # ceph pg dump pgs_brief | head -8 > PG_STAT STATE UP UP_PRIMARY ACTING > ACTING_PRIMARY > 3.7e active+remapped+backfill_wait [6,0,4,5,1,2147483647] 6 > [6,0,2,5,3,7] 6 > 2.7f active+clean [7,5,2] 7 [7,5,2] 7 > 3.7f active+remapped+backfill_wait [1,2147483647,7,8,2147483647,2] > 1 [0,5,4,8,6,2] 0 > 2.7e active+remapped+backfill_wait [5,4,8] 5 > [1,8,0] 1 > 3.7c active+remapped+backfill_wait > [2147483647,1,0,2147483647,2147483647,8] 1 [6,5,0,7,2,8] 6 > 2.7d active+remapped+backfill_wait [0,3,6] 0 > [0,3,8] 0 > 3.7d active+remapped+backfill_wait [2147483647,0,3,6,4,2] 0 > [7,0,3,8,1,2] 7 > > > After osd crush reweight to 0.5*0.27190: > > # ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 1.22346 root default > -3 0.40782 host tceph-01 > 0 hdd 0.13594 osd.0 up 1.00000 1.00000 > 2 hdd 0.13594 osd.2 up 1.00000 1.00000 > 4 hdd 0.13594 osd.4 up 1.00000 1.00000 > -7 0.40782 host tceph-02 > 6 hdd 0.13594 osd.6 up 1.00000 1.00000 > 7 hdd 0.13594 osd.7 up 1.00000 1.00000 > 8 hdd 0.13594 osd.8 up 1.00000 1.00000 > -5 0.40782 host tceph-03 > 1 hdd 0.13594 osd.1 up 1.00000 1.00000 > 3 hdd 0.13594 osd.3 up 1.00000 1.00000 > 5 hdd 0.13594 osd.5 up 1.00000 1.00000 > > # ceph pg dump pgs_brief | head -8 > PG_STAT STATE UP UP_PRIMARY ACTING > ACTING_PRIMARY > 3.7e active+clean [6,0,2,5,3,7] 6 [6,0,2,5,3,7] 6 > 2.7f active+clean [7,5,2] 7 [7,5,2] 7 > 3.7f active+clean [0,5,4,8,6,2] 0 [0,5,4,8,6,2] 0 > 2.7e active+clean [0,1,8] 0 [0,1,8] 0 > 3.7c active+clean [6,5,0,7,2,8] 6 [6,5,0,7,2,8] 6 > 2.7d active+clean [0,8,3] 0 [0,8,3] 0 > 3.7d active+clean [7,0,3,8,1,2] 7 [7,0,3,8,1,2] 7 > > > According to the documentation, I would expect identical mappings in all 3 > cases. Can someone help me out here? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: 15 November 2022 10:09:10 > To: ceph-users@xxxxxxx > Subject: OSDs down after reweight > > Hi all, > > I re-weighted all OSDs in a pool down from 1.0 to the same value 0.052 (see > reason below). After this, all hell broke loose. OSDs were marked down, slow > OPS all over the place and the MDSes started complaining about slow > ops/requests. Basically all PGs were remapped. After setting all re-weights > back to 1.0 the situation went back to normal. > > Expected behaviour: No (!!!) PGs are remapped and everything continues to > work. Why did things go down? > > More details: We have 24 OSDs with weight=1.74699 in a pool. I wanted to > add OSDs with weight=0.09099 in such a way that the small OSDs receive > approximately the same number of PGs as the large ones. Setting a re- > weight factor of 0.052 for the large ones should achieve just that: > 1.74699*0.05=0.09084. So, procedure was: > > - ceph osd crush reweight osd.N 0.052 for all OSDs in that pool > - add the small disks and re-balance > > I would expect that the crush mapping is invariant under a uniform change of > weight. That is, if I apply the same relative weight-change to all OSDs > (new_weight=old_weight*common_factor) in a pool, the mappings should > be preserved. However, this is not what I observed. How is it possible that > PG mappings change if the relative weight of all OSDs to each other stays the > same (the probabilities of picking an OSD are unchanged over all OSDs)? > > Thanks for any hints. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email > to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email > to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx