Re: OSDs down after reweight

Frank Schilder <frans@xxxxxx> · Tue, 15 Nov 2022 09:38:22 +0000

Here is how this looks on a test cluster:

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         2.44707  root default
-3         0.81569      host tceph-01
 0    hdd  0.27190          osd.0          up   1.00000  1.00000
 2    hdd  0.27190          osd.2          up   1.00000  1.00000
 4    hdd  0.27190          osd.4          up   1.00000  1.00000
-7         0.81569      host tceph-02
 6    hdd  0.27190          osd.6          up   1.00000  1.00000
 7    hdd  0.27190          osd.7          up   1.00000  1.00000
 8    hdd  0.27190          osd.8          up   1.00000  1.00000
-5         0.81569      host tceph-03
 1    hdd  0.27190          osd.1          up   1.00000  1.00000
 3    hdd  0.27190          osd.3          up   1.00000  1.00000
 5    hdd  0.27190          osd.5          up   1.00000  1.00000

# ceph pg dump pgs_brief | head -8
PG_STAT  STATE                   UP             UP_PRIMARY  ACTING         ACTING_PRIMARY
3.7e               active+clean  [6,0,2,5,3,7]           6  [6,0,2,5,3,7]               6
2.7f               active+clean        [7,5,2]           7        [7,5,2]               7
2.7e               active+clean        [0,1,8]           0        [0,1,8]               0
3.7c               active+clean  [6,5,0,7,2,8]           6  [6,5,0,7,2,8]               6
2.7d               active+clean        [0,8,3]           0        [0,8,3]               0
3.7d               active+clean  [7,0,3,8,1,2]           7  [7,0,3,8,1,2]               7

After osd reweight to 0.5:

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         2.44707  root default
-3         0.81569      host tceph-01
 0    hdd  0.27190          osd.0          up   0.50000  1.00000
 2    hdd  0.27190          osd.2          up   0.50000  1.00000
 4    hdd  0.27190          osd.4          up   0.50000  1.00000
-7         0.81569      host tceph-02
 6    hdd  0.27190          osd.6          up   0.50000  1.00000
 7    hdd  0.27190          osd.7          up   0.50000  1.00000
 8    hdd  0.27190          osd.8          up   0.50000  1.00000
-5         0.81569      host tceph-03
 1    hdd  0.27190          osd.1          up   0.50000  1.00000
 3    hdd  0.27190          osd.3          up   0.50000  1.00000
 5    hdd  0.27190          osd.5          up   0.50000  1.00000

# ceph pg dump pgs_brief | head -8
PG_STAT  STATE                          UP                                        UP_PRIMARY  ACTING         ACTING_PRIMARY
3.7e     active+remapped+backfill_wait                    [6,0,4,5,1,2147483647]           6  [6,0,2,5,3,7]               6
2.7f                      active+clean                                   [7,5,2]           7        [7,5,2]               7
3.7f     active+remapped+backfill_wait           [1,2147483647,7,8,2147483647,2]           1  [0,5,4,8,6,2]               0
2.7e     active+remapped+backfill_wait                                   [5,4,8]           5        [1,8,0]               1
3.7c     active+remapped+backfill_wait  [2147483647,1,0,2147483647,2147483647,8]           1  [6,5,0,7,2,8]               6
2.7d     active+remapped+backfill_wait                                   [0,3,6]           0        [0,3,8]               0
3.7d     active+remapped+backfill_wait                    [2147483647,0,3,6,4,2]           0  [7,0,3,8,1,2]               7

After osd crush reweight to 0.5*0.27190:

# ceph osd tree
ID  CLASS  WEIGHT   TYPE NAME          STATUS  REWEIGHT  PRI-AFF
-1         1.22346  root default
-3         0.40782      host tceph-01
 0    hdd  0.13594          osd.0          up   1.00000  1.00000
 2    hdd  0.13594          osd.2          up   1.00000  1.00000
 4    hdd  0.13594          osd.4          up   1.00000  1.00000
-7         0.40782      host tceph-02
 6    hdd  0.13594          osd.6          up   1.00000  1.00000
 7    hdd  0.13594          osd.7          up   1.00000  1.00000
 8    hdd  0.13594          osd.8          up   1.00000  1.00000
-5         0.40782      host tceph-03
 1    hdd  0.13594          osd.1          up   1.00000  1.00000
 3    hdd  0.13594          osd.3          up   1.00000  1.00000
 5    hdd  0.13594          osd.5          up   1.00000  1.00000

# ceph pg dump pgs_brief | head -8
PG_STAT  STATE                        UP             UP_PRIMARY  ACTING         ACTING_PRIMARY
3.7e                    active+clean  [6,0,2,5,3,7]           6  [6,0,2,5,3,7]               6
2.7f                    active+clean        [7,5,2]           7        [7,5,2]               7
3.7f                    active+clean  [0,5,4,8,6,2]           0  [0,5,4,8,6,2]               0
2.7e                    active+clean        [0,1,8]           0        [0,1,8]               0
3.7c                    active+clean  [6,5,0,7,2,8]           6  [6,5,0,7,2,8]               6
2.7d                    active+clean        [0,8,3]           0        [0,8,3]               0
3.7d                    active+clean  [7,0,3,8,1,2]           7  [7,0,3,8,1,2]               7

According to the documentation, I would expect identical mappings in all 3 cases. Can someone help me out here?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 15 November 2022 10:09:10
To: ceph-users@xxxxxxx
Subject:  OSDs down after reweight

Hi all,

I re-weighted all OSDs in a pool down from 1.0 to the same value 0.052 (see reason below). After this, all hell broke loose. OSDs were marked down, slow OPS all over the place and the MDSes started complaining about slow ops/requests. Basically all PGs were remapped. After setting all re-weights back to 1.0 the situation went back to normal.

Expected behaviour: No (!!!) PGs are remapped and everything continues to work. Why did things go down?

More details: We have 24 OSDs with weight=1.74699 in a pool. I wanted to add OSDs with weight=0.09099 in such a way that the small OSDs receive approximately the same number of PGs as the large ones. Setting a re-weight factor of 0.052 for the large ones should achieve just that: 1.74699*0.05=0.09084. So, procedure was:

- ceph osd crush reweight osd.N 0.052 for all OSDs in that pool
- add the small disks and re-balance

I would expect that the crush mapping is invariant under a uniform change of weight. That is, if I apply the same relative weight-change to all OSDs (new_weight=old_weight*common_factor) in a pool, the mappings should be preserved. However, this is not what I observed. How is it possible that PG mappings change if the relative weight of all OSDs to each other stays the same (the probabilities of picking an OSD are unchanged over all OSDs)?

Thanks for any hints.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx