Hi Andreas, I made exactly the same observation in another scenario. I added some OSDs while other OSDs were down. This is expected. The crush map is an a-priory algorithm to compute the location of objects without contacting a central server. Hence, *any*change of a crush map while an OSD is down will lead to a change of locations of objects/PGs of the down OSD. Consequently, these objects/PGs will become degraded, because no up OSD reports these. Once the peering is over after setting the weight to 0, the cluster must assume they are lost. Changing a weight is a change of the crush map. The way to get the cluster to re-scan after starting the down OSDs is to restore the crush map to exactly the state as it was before the OSD went down. In your case, - starting the down OSD, - setting weight back to original value will find all missing objects. After the cluster is clean, set the weight back to 0 and now the OSD will be vacated as expected. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> Sent: 18 May 2020 22:25:37 To: ceph-users Subject: Reweighting OSD while down results in undersized+degraded PGs In a recent cluster reorganization, we ended up with a lot of undersized/degraded PGs and a day of recovery from them, when all we expected was moving some data around. After retracing my steps, I found something odd. If I crush reweight an OSD to 0 while it is down - it results in the PGs of that OSD ending up degraded even after the OSD is restarted. If I do the same reweighting while the OSD is up - data gets moved without any degraded/undersized states. I would not expect this - so I wonder if this is a bug or is somehow intended. This is on ceph Nautilus 14.2.8. Below are the details. Andras First the case that works as I would expect: # Healthy cluster ... [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 270 up (since 2m), 270 in (since 4h) flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 5312 active+clean # Reweight an OSD to 0 [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0 reweighted item id 0 name 'osd.0' to 0 in crush map # Crush map changes - data movement is set up, no degraded PGs: [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 270 up (since 10m), 270 in (since 5h); 175 remapped pgs flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 2562045/232996662 objects misplaced (1.100%) 5137 active+clean 172 active+remapped+backfilling 3 active+remapped+backfill_wait # Reweight it back to the original weight [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0 # Cluster goes back to clean reweighted item id 0 name 'osd.0' to 8 in crush map [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 270 up (since 11m), 270 in (since 5h) flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 5312 active+clean # # Now the problematic case # # Stop an OSD [root@xorphosd00 ~]# systemctl stop ceph-osd@0 # We get degraded PGs - as expected [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set 1 osds down Degraded data redundancy: 873964/232996662 objects degraded (0.375%), 82 pgs degraded services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 269 up (since 16s), 270 in (since 5h) flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 873964/232996662 objects degraded (0.375%) 5230 active+clean 82 active+undersized+degraded # Reweight the OSD to 0: [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0 # Still degraded - as expected reweighted item id 0 name 'osd.0' to 0 in crush map [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set 1 osds down Degraded data redundancy: 873964/232996662 objects degraded (0.375%), 82 pgs degraded services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 269 up (since 59s), 270 in (since 5h); 175 remapped pgs flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 873964/232996662 objects degraded (0.375%) 1688081/232996662 objects misplaced (0.725%) 5137 active+clean 93 active+remapped+backfilling 82 active+undersized+degraded+remapped+backfilling # Restarting the OSD [root@xorphosd00 ~]# systemctl start ceph-osd@0 # And the PGs still stay degraded - THIS IS UNEXPECTED!!! [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set Degraded data redundancy: 873964/232996662 objects degraded (0.375%), 82 pgs degraded services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 270 up (since 14s), 270 in (since 5h); 175 remapped pgs flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 873964/232996662 objects degraded (0.375%) 1688081/232996662 objects misplaced (0.725%) 5137 active+clean 93 active+remapped+backfilling 82 active+undersized+degraded+remapped+backfilling # Now for something even more odd - reweight the OSD back to its original weigh # and all the data gets magically FOUND again on that OSD!!! [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0 reweighted item id 0 name 'osd.0' to 8 in crush map [root@xorphosd00 ~]# ceph -s cluster: id: 86d8a1b9-761b-4099-a960-6a303b951236 health: HEALTH_WARN noout,nobackfill,noscrub,nodeep-scrub flag(s) set services: mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby osd: 270 osds: 270 up (since 51s), 270 in (since 5h) flags noout,nobackfill,noscrub,nodeep-scrub data: pools: 4 pools, 5312 pgs objects: 75.87M objects, 287 TiB usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail pgs: 5312 active+clean _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx