Re: Reweighting OSD while down results in undersized+degraded PGs

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 20 May 2020 09:41:54 +0200

Hi Andras,

To me it looks like the osd.0 is not peering when it starts with crush weight 0.

I would try forcing the re-peering with `ceph osd down osd.0` when the
PGs are unexpectedly degraded. (e.g start the osd when crush weight is
0, then obverve the PGs are still degraded, then force the re-peering
-- does it help?)

Otherwise I agree, to me this is an unexpected behaviour -- maybe open a ticket?

Cheers, Dan

P.S. For some reason all of your mails are repeatedly landing in my
spam folder. I think this is the reason:

ARC-Authentication-Results: i=1; mx.google.com;
       dkim=neutral (body hash did not verify)
header.i=@flatironinstitute.org header.s=google header.b=NvX+wag9;
       spf=fail (google.com: domain of ceph-users-bounces@xxxxxxx does
not designate 217.70.178.232 as permitted sender)
smtp.mailfrom=ceph-users-bounces@xxxxxxx;
       dmarc=fail (p=REJECT sp=REJECT dis=QUARANTINE)
header.from=flatironinstitute.org

On Mon, May 18, 2020 at 10:26 PM Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> In a recent cluster reorganization, we ended up with a lot of
> undersized/degraded PGs and a day of recovery from them, when all we
> expected was moving some data around.  After retracing my steps, I found
> something odd.  If I crush reweight an OSD to  0 while it is down - it
> results in the PGs of that OSD ending up degraded even after the OSD is
> restarted.  If I do the same reweighting while the OSD is up - data gets
> moved without any degraded/undersized states. I would not expect this -
> so I wonder if this is a bug or is somehow intended.  This is on ceph
> Nautilus 14.2.8.  Below are the details.
>
> Andras
>
>
> First the case that works as I would expect:
>
> # Healthy cluster ...
> [root@xorphosd00 ~]# ceph -s
>    cluster:
>      id:     86d8a1b9-761b-4099-a960-6a303b951236
>      health: HEALTH_WARN
>              noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
>    services:
>      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>      osd: 270 osds: 270 up (since 2m), 270 in (since 4h)
>           flags noout,nobackfill,noscrub,nodeep-scrub
>
>    data:
>      pools:   4 pools, 5312 pgs
>      objects: 75.87M objects, 287 TiB
>      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>      pgs:     5312 active+clean
>
> # Reweight an OSD to 0
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
> reweighted item id 0 name 'osd.0' to 0 in crush map
>
> # Crush map changes - data movement is set up, no degraded PGs:
> [root@xorphosd00 ~]# ceph -s
>    cluster:
>      id:     86d8a1b9-761b-4099-a960-6a303b951236
>      health: HEALTH_WARN
>              noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
>    services:
>      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>      osd: 270 osds: 270 up (since 10m), 270 in (since 5h); 175 remapped pgs
>           flags noout,nobackfill,noscrub,nodeep-scrub
>
>    data:
>      pools:   4 pools, 5312 pgs
>      objects: 75.87M objects, 287 TiB
>      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>      pgs:     2562045/232996662 objects misplaced (1.100%)
>               5137 active+clean
>               172  active+remapped+backfilling
>               3    active+remapped+backfill_wait
>
> # Reweight it back to the original weight
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
>
> # Cluster goes back to clean
> reweighted item id 0 name 'osd.0' to 8 in crush map
> [root@xorphosd00 ~]# ceph -s
>    cluster:
>      id:     86d8a1b9-761b-4099-a960-6a303b951236
>      health: HEALTH_WARN
>              noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
>    services:
>      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>      osd: 270 osds: 270 up (since 11m), 270 in (since 5h)
>           flags noout,nobackfill,noscrub,nodeep-scrub
>
>    data:
>      pools:   4 pools, 5312 pgs
>      objects: 75.87M objects, 287 TiB
>      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>      pgs:     5312 active+clean
>
>
>
>
> #
> # Now the problematic case
> #
>
> # Stop an OSD
> [root@xorphosd00 ~]# systemctl stop ceph-osd@0
>
> # We get degraded PGs - as expected
> [root@xorphosd00 ~]# ceph -s
>    cluster:
>      id:     86d8a1b9-761b-4099-a960-6a303b951236
>      health: HEALTH_WARN
>              noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>              1 osds down
>              Degraded data redundancy: 873964/232996662 objects degraded
> (0.375%), 82 pgs degraded
>
>    services:
>      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>      osd: 270 osds: 269 up (since 16s), 270 in (since 5h)
>           flags noout,nobackfill,noscrub,nodeep-scrub
>
>    data:
>      pools:   4 pools, 5312 pgs
>      objects: 75.87M objects, 287 TiB
>      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>      pgs:     873964/232996662 objects degraded (0.375%)
>               5230 active+clean
>               82   active+undersized+degraded
>
> # Reweight the OSD to 0:
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
>
> # Still degraded - as expected
> reweighted item id 0 name 'osd.0' to 0 in crush map
> [root@xorphosd00 ~]# ceph -s
>    cluster:
>      id:     86d8a1b9-761b-4099-a960-6a303b951236
>      health: HEALTH_WARN
>              noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>              1 osds down
>              Degraded data redundancy: 873964/232996662 objects degraded
> (0.375%), 82 pgs degraded
>
>    services:
>      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>      osd: 270 osds: 269 up (since 59s), 270 in (since 5h); 175 remapped pgs
>           flags noout,nobackfill,noscrub,nodeep-scrub
>
>    data:
>      pools:   4 pools, 5312 pgs
>      objects: 75.87M objects, 287 TiB
>      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>      pgs:     873964/232996662 objects degraded (0.375%)
>               1688081/232996662 objects misplaced (0.725%)
>               5137 active+clean
>               93   active+remapped+backfilling
>               82   active+undersized+degraded+remapped+backfilling
>
> # Restarting the OSD
> [root@xorphosd00 ~]# systemctl start ceph-osd@0
>
> # And the PGs still stay degraded - THIS IS UNEXPECTED!!!
> [root@xorphosd00 ~]# ceph -s
>    cluster:
>      id:     86d8a1b9-761b-4099-a960-6a303b951236
>      health: HEALTH_WARN
>              noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>              Degraded data redundancy: 873964/232996662 objects degraded
> (0.375%), 82 pgs degraded
>
>    services:
>      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>      osd: 270 osds: 270 up (since 14s), 270 in (since 5h); 175 remapped pgs
>           flags noout,nobackfill,noscrub,nodeep-scrub
>
>    data:
>      pools:   4 pools, 5312 pgs
>      objects: 75.87M objects, 287 TiB
>      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>      pgs:     873964/232996662 objects degraded (0.375%)
>               1688081/232996662 objects misplaced (0.725%)
>               5137 active+clean
>               93   active+remapped+backfilling
>               82   active+undersized+degraded+remapped+backfilling
>
> # Now for something even more odd - reweight the OSD back to its
> original weigh
> # and all the data gets magically FOUND again on that OSD!!!
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
> reweighted item id 0 name 'osd.0' to 8 in crush map
> [root@xorphosd00 ~]# ceph -s
>    cluster:
>      id:     86d8a1b9-761b-4099-a960-6a303b951236
>      health: HEALTH_WARN
>              noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
>    services:
>      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>      osd: 270 osds: 270 up (since 51s), 270 in (since 5h)
>           flags noout,nobackfill,noscrub,nodeep-scrub
>
>    data:
>      pools:   4 pools, 5312 pgs
>      objects: 75.87M objects, 287 TiB
>      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>      pgs:     5312 active+clean
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx