Hi Andras, To me it looks like the osd.0 is not peering when it starts with crush weight 0. I would try forcing the re-peering with `ceph osd down osd.0` when the PGs are unexpectedly degraded. (e.g start the osd when crush weight is 0, then obverve the PGs are still degraded, then force the re-peering -- does it help?) Otherwise I agree, to me this is an unexpected behaviour -- maybe open a ticket? Cheers, Dan P.S. For some reason all of your mails are repeatedly landing in my spam folder. I think this is the reason: ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@flatironinstitute.org header.s=google header.b=NvX+wag9; spf=fail (google.com: domain of ceph-users-bounces@xxxxxxx does not designate 217.70.178.232 as permitted sender) smtp.mailfrom=ceph-users-bounces@xxxxxxx; dmarc=fail (p=REJECT sp=REJECT dis=QUARANTINE) header.from=flatironinstitute.org On Mon, May 18, 2020 at 10:26 PM Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: > > In a recent cluster reorganization, we ended up with a lot of > undersized/degraded PGs and a day of recovery from them, when all we > expected was moving some data around. After retracing my steps, I found > something odd. If I crush reweight an OSD to 0 while it is down - it > results in the PGs of that OSD ending up degraded even after the OSD is > restarted. If I do the same reweighting while the OSD is up - data gets > moved without any degraded/undersized states. I would not expect this - > so I wonder if this is a bug or is somehow intended. This is on ceph > Nautilus 14.2.8. Below are the details. > > Andras > > > First the case that works as I would expect: > > # Healthy cluster ... > [root@xorphosd00 ~]# ceph -s > cluster: > id: 86d8a1b9-761b-4099-a960-6a303b951236 > health: HEALTH_WARN > noout,nobackfill,noscrub,nodeep-scrub flag(s) set > > services: > mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) > mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 > mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby > osd: 270 osds: 270 up (since 2m), 270 in (since 4h) > flags noout,nobackfill,noscrub,nodeep-scrub > > data: > pools: 4 pools, 5312 pgs > objects: 75.87M objects, 287 TiB > usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail > pgs: 5312 active+clean > > # Reweight an OSD to 0 > [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0 > reweighted item id 0 name 'osd.0' to 0 in crush map > > # Crush map changes - data movement is set up, no degraded PGs: > [root@xorphosd00 ~]# ceph -s > cluster: > id: 86d8a1b9-761b-4099-a960-6a303b951236 > health: HEALTH_WARN > noout,nobackfill,noscrub,nodeep-scrub flag(s) set > > services: > mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) > mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 > mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby > osd: 270 osds: 270 up (since 10m), 270 in (since 5h); 175 remapped pgs > flags noout,nobackfill,noscrub,nodeep-scrub > > data: > pools: 4 pools, 5312 pgs > objects: 75.87M objects, 287 TiB > usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail > pgs: 2562045/232996662 objects misplaced (1.100%) > 5137 active+clean > 172 active+remapped+backfilling > 3 active+remapped+backfill_wait > > # Reweight it back to the original weight > [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0 > > # Cluster goes back to clean > reweighted item id 0 name 'osd.0' to 8 in crush map > [root@xorphosd00 ~]# ceph -s > cluster: > id: 86d8a1b9-761b-4099-a960-6a303b951236 > health: HEALTH_WARN > noout,nobackfill,noscrub,nodeep-scrub flag(s) set > > services: > mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) > mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 > mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby > osd: 270 osds: 270 up (since 11m), 270 in (since 5h) > flags noout,nobackfill,noscrub,nodeep-scrub > > data: > pools: 4 pools, 5312 pgs > objects: 75.87M objects, 287 TiB > usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail > pgs: 5312 active+clean > > > > > # > # Now the problematic case > # > > # Stop an OSD > [root@xorphosd00 ~]# systemctl stop ceph-osd@0 > > # We get degraded PGs - as expected > [root@xorphosd00 ~]# ceph -s > cluster: > id: 86d8a1b9-761b-4099-a960-6a303b951236 > health: HEALTH_WARN > noout,nobackfill,noscrub,nodeep-scrub flag(s) set > 1 osds down > Degraded data redundancy: 873964/232996662 objects degraded > (0.375%), 82 pgs degraded > > services: > mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) > mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 > mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby > osd: 270 osds: 269 up (since 16s), 270 in (since 5h) > flags noout,nobackfill,noscrub,nodeep-scrub > > data: > pools: 4 pools, 5312 pgs > objects: 75.87M objects, 287 TiB > usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail > pgs: 873964/232996662 objects degraded (0.375%) > 5230 active+clean > 82 active+undersized+degraded > > # Reweight the OSD to 0: > [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0 > > # Still degraded - as expected > reweighted item id 0 name 'osd.0' to 0 in crush map > [root@xorphosd00 ~]# ceph -s > cluster: > id: 86d8a1b9-761b-4099-a960-6a303b951236 > health: HEALTH_WARN > noout,nobackfill,noscrub,nodeep-scrub flag(s) set > 1 osds down > Degraded data redundancy: 873964/232996662 objects degraded > (0.375%), 82 pgs degraded > > services: > mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) > mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 > mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby > osd: 270 osds: 269 up (since 59s), 270 in (since 5h); 175 remapped pgs > flags noout,nobackfill,noscrub,nodeep-scrub > > data: > pools: 4 pools, 5312 pgs > objects: 75.87M objects, 287 TiB > usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail > pgs: 873964/232996662 objects degraded (0.375%) > 1688081/232996662 objects misplaced (0.725%) > 5137 active+clean > 93 active+remapped+backfilling > 82 active+undersized+degraded+remapped+backfilling > > # Restarting the OSD > [root@xorphosd00 ~]# systemctl start ceph-osd@0 > > # And the PGs still stay degraded - THIS IS UNEXPECTED!!! > [root@xorphosd00 ~]# ceph -s > cluster: > id: 86d8a1b9-761b-4099-a960-6a303b951236 > health: HEALTH_WARN > noout,nobackfill,noscrub,nodeep-scrub flag(s) set > Degraded data redundancy: 873964/232996662 objects degraded > (0.375%), 82 pgs degraded > > services: > mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) > mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 > mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby > osd: 270 osds: 270 up (since 14s), 270 in (since 5h); 175 remapped pgs > flags noout,nobackfill,noscrub,nodeep-scrub > > data: > pools: 4 pools, 5312 pgs > objects: 75.87M objects, 287 TiB > usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail > pgs: 873964/232996662 objects degraded (0.375%) > 1688081/232996662 objects misplaced (0.725%) > 5137 active+clean > 93 active+remapped+backfilling > 82 active+undersized+degraded+remapped+backfilling > > # Now for something even more odd - reweight the OSD back to its > original weigh > # and all the data gets magically FOUND again on that OSD!!! > [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0 > reweighted item id 0 name 'osd.0' to 8 in crush map > [root@xorphosd00 ~]# ceph -s > cluster: > id: 86d8a1b9-761b-4099-a960-6a303b951236 > health: HEALTH_WARN > noout,nobackfill,noscrub,nodeep-scrub flag(s) set > > services: > mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d) > mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00 > mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby > osd: 270 osds: 270 up (since 51s), 270 in (since 5h) > flags noout,nobackfill,noscrub,nodeep-scrub > > data: > pools: 4 pools, 5312 pgs > objects: 75.87M objects, 287 TiB > usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail > pgs: 5312 active+clean > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx