Reweighting OSD while down results in undersized+degraded PGs

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Mon, 18 May 2020 16:25:37 -0400

In a recent cluster reorganization, we ended up with a lot of 
undersized/degraded PGs and a day of recovery from them, when all we 
expected was moving some data around.  After retracing my steps, I found 
something odd.  If I crush reweight an OSD to  0 while it is down - it 
results in the PGs of that OSD ending up degraded even after the OSD is 
restarted.  If I do the same reweighting while the OSD is up - data gets 
moved without any degraded/undersized states. I would not expect this - 
so I wonder if this is a bug or is somehow intended.  This is on ceph 
Nautilus 14.2.8.  Below are the details.

Andras

First the case that works as I would expect:

# Healthy cluster ...
[root@xorphosd00 ~]# ceph -s
  cluster:
    id:     86d8a1b9-761b-4099-a960-6a303b951236
    health: HEALTH_WARN
            noout,nobackfill,noscrub,nodeep-scrub flag(s) set

  services:
    mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
    mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
    mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
    osd: 270 osds: 270 up (since 2m), 270 in (since 4h)
         flags noout,nobackfill,noscrub,nodeep-scrub

  data:
    pools:   4 pools, 5312 pgs
    objects: 75.87M objects, 287 TiB
    usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     5312 active+clean

# Reweight an OSD to 0
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
reweighted item id 0 name 'osd.0' to 0 in crush map

# Crush map changes - data movement is set up, no degraded PGs:
[root@xorphosd00 ~]# ceph -s
  cluster:
    id:     86d8a1b9-761b-4099-a960-6a303b951236
    health: HEALTH_WARN
            noout,nobackfill,noscrub,nodeep-scrub flag(s) set

  services:
    mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
    mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
    mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
    osd: 270 osds: 270 up (since 10m), 270 in (since 5h); 175 remapped pgs
         flags noout,nobackfill,noscrub,nodeep-scrub

  data:
    pools:   4 pools, 5312 pgs
    objects: 75.87M objects, 287 TiB
    usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     2562045/232996662 objects misplaced (1.100%)
             5137 active+clean
             172  active+remapped+backfilling
             3    active+remapped+backfill_wait

# Reweight it back to the original weight
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0

# Cluster goes back to clean
reweighted item id 0 name 'osd.0' to 8 in crush map
[root@xorphosd00 ~]# ceph -s
  cluster:
    id:     86d8a1b9-761b-4099-a960-6a303b951236
    health: HEALTH_WARN
            noout,nobackfill,noscrub,nodeep-scrub flag(s) set

  services:
    mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
    mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
    mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
    osd: 270 osds: 270 up (since 11m), 270 in (since 5h)
         flags noout,nobackfill,noscrub,nodeep-scrub

  data:
    pools:   4 pools, 5312 pgs
    objects: 75.87M objects, 287 TiB
    usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     5312 active+clean

#
# Now the problematic case
#

# Stop an OSD
[root@xorphosd00 ~]# systemctl stop ceph-osd@0

# We get degraded PGs - as expected
[root@xorphosd00 ~]# ceph -s
  cluster:
    id:     86d8a1b9-761b-4099-a960-6a303b951236
    health: HEALTH_WARN
            noout,nobackfill,noscrub,nodeep-scrub flag(s) set
            1 osds down
            Degraded data redundancy: 873964/232996662 objects degraded 
(0.375%), 82 pgs degraded

  services:
    mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
    mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
    mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
    osd: 270 osds: 269 up (since 16s), 270 in (since 5h)
         flags noout,nobackfill,noscrub,nodeep-scrub

  data:
    pools:   4 pools, 5312 pgs
    objects: 75.87M objects, 287 TiB
    usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     873964/232996662 objects degraded (0.375%)
             5230 active+clean
             82   active+undersized+degraded

# Reweight the OSD to 0:
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0

# Still degraded - as expected
reweighted item id 0 name 'osd.0' to 0 in crush map
[root@xorphosd00 ~]# ceph -s
  cluster:
    id:     86d8a1b9-761b-4099-a960-6a303b951236
    health: HEALTH_WARN
            noout,nobackfill,noscrub,nodeep-scrub flag(s) set
            1 osds down
            Degraded data redundancy: 873964/232996662 objects degraded 
(0.375%), 82 pgs degraded

  services:
    mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
    mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
    mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
    osd: 270 osds: 269 up (since 59s), 270 in (since 5h); 175 remapped pgs
         flags noout,nobackfill,noscrub,nodeep-scrub

  data:
    pools:   4 pools, 5312 pgs
    objects: 75.87M objects, 287 TiB
    usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     873964/232996662 objects degraded (0.375%)
             1688081/232996662 objects misplaced (0.725%)
             5137 active+clean
             93   active+remapped+backfilling
             82   active+undersized+degraded+remapped+backfilling

# Restarting the OSD
[root@xorphosd00 ~]# systemctl start ceph-osd@0

# And the PGs still stay degraded - THIS IS UNEXPECTED!!!
[root@xorphosd00 ~]# ceph -s
  cluster:
    id:     86d8a1b9-761b-4099-a960-6a303b951236
    health: HEALTH_WARN
            noout,nobackfill,noscrub,nodeep-scrub flag(s) set
            Degraded data redundancy: 873964/232996662 objects degraded 
(0.375%), 82 pgs degraded

  services:
    mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
    mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
    mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
    osd: 270 osds: 270 up (since 14s), 270 in (since 5h); 175 remapped pgs
         flags noout,nobackfill,noscrub,nodeep-scrub

  data:
    pools:   4 pools, 5312 pgs
    objects: 75.87M objects, 287 TiB
    usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     873964/232996662 objects degraded (0.375%)
             1688081/232996662 objects misplaced (0.725%)
             5137 active+clean
             93   active+remapped+backfilling
             82   active+undersized+degraded+remapped+backfilling

# Now for something even more odd - reweight the OSD back to its 
original weigh
# and all the data gets magically FOUND again on that OSD!!!
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
reweighted item id 0 name 'osd.0' to 8 in crush map
[root@xorphosd00 ~]# ceph -s
  cluster:
    id:     86d8a1b9-761b-4099-a960-6a303b951236
    health: HEALTH_WARN
            noout,nobackfill,noscrub,nodeep-scrub flag(s) set

  services:
    mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
    mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
    mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
    osd: 270 osds: 270 up (since 51s), 270 in (since 5h)
         flags noout,nobackfill,noscrub,nodeep-scrub

  data:
    pools:   4 pools, 5312 pgs
    objects: 75.87M objects, 287 TiB
    usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
    pgs:     5312 active+clean

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx