Re: Reweighting OSD while down results in undersized+degraded PGs

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Wed, 20 May 2020 11:16:57 -0400

Hi Dan,

Unfortunately 'ceph osd down osd.0' doesn't help - it is marked down and 
soon after back up, but it doesn't peer still.  I tried reweighting the 
OSD to half its weight, 4.0 instead of 0.0, and that results in about 
half the PGs staying degraded.  So this is not specific to zero weight.  
It looks like when the OSD starts, it gets a crush map and tries to peer 
the PGs that are in that map.  It doesn't look like it tries to peer for 
PGs it has data for but it isn't responsible for in the crush map 
currently.  I guess I don't fully understand the design on what the 
intended behavior is.

I also get a lot of the mailing list going to spam.  The reason has to 
do with how the mailing list is set up.  It resends all messages keeping 
the sender (From:) the same using servers that are unrelated to the 
sender.  Thus messages looks like spam a lot - sending mails in my name 
using a server not registered as allowed to send for my domain.  Not 
sure what can be done about it.

Andras

On 5/20/20 3:41 AM, Dan van der Ster wrote:
Hi Andras,

To me it looks like the osd.0 is not peering when it starts with crush weight 0.

I would try forcing the re-peering with `ceph osd down osd.0` when the
PGs are unexpectedly degraded. (e.g start the osd when crush weight is
0, then obverve the PGs are still degraded, then force the re-peering
-- does it help?)

Otherwise I agree, to me this is an unexpected behaviour -- maybe open a ticket?

Cheers, Dan

P.S. For some reason all of your mails are repeatedly landing in my
spam folder. I think this is the reason:

ARC-Authentication-Results: i=1; mx.google.com;
        dkim=neutral (body hash did not verify)
header.i=@flatironinstitute.org header.s=google header.b=NvX+wag9;
        spf=fail (google.com: domain of ceph-users-bounces@xxxxxxx does
not designate 217.70.178.232 as permitted sender)
smtp.mailfrom=ceph-users-bounces@xxxxxxx;
        dmarc=fail (p=REJECT sp=REJECT dis=QUARANTINE)
header.from=flatironinstitute.org

On Mon, May 18, 2020 at 10:26 PM Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
In a recent cluster reorganization, we ended up with a lot of
undersized/degraded PGs and a day of recovery from them, when all we
expected was moving some data around.  After retracing my steps, I found
something odd.  If I crush reweight an OSD to  0 while it is down - it
results in the PGs of that OSD ending up degraded even after the OSD is
restarted.  If I do the same reweighting while the OSD is up - data gets
moved without any degraded/undersized states. I would not expect this -
so I wonder if this is a bug or is somehow intended.  This is on ceph
Nautilus 14.2.8.  Below are the details.

Andras

First the case that works as I would expect:

# Healthy cluster ...
[root@xorphosd00 ~]# ceph -s
    cluster:
      id:     86d8a1b9-761b-4099-a960-6a303b951236
      health: HEALTH_WARN
              noout,nobackfill,noscrub,nodeep-scrub flag(s) set

    services:
      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
      osd: 270 osds: 270 up (since 2m), 270 in (since 4h)
           flags noout,nobackfill,noscrub,nodeep-scrub

    data:
      pools:   4 pools, 5312 pgs
      objects: 75.87M objects, 287 TiB
      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
      pgs:     5312 active+clean

# Reweight an OSD to 0
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
reweighted item id 0 name 'osd.0' to 0 in crush map

# Crush map changes - data movement is set up, no degraded PGs:
[root@xorphosd00 ~]# ceph -s
    cluster:
      id:     86d8a1b9-761b-4099-a960-6a303b951236
      health: HEALTH_WARN
              noout,nobackfill,noscrub,nodeep-scrub flag(s) set

    services:
      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
      osd: 270 osds: 270 up (since 10m), 270 in (since 5h); 175 remapped pgs
           flags noout,nobackfill,noscrub,nodeep-scrub

    data:
      pools:   4 pools, 5312 pgs
      objects: 75.87M objects, 287 TiB
      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
      pgs:     2562045/232996662 objects misplaced (1.100%)
               5137 active+clean
               172  active+remapped+backfilling
               3    active+remapped+backfill_wait

# Reweight it back to the original weight
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0

# Cluster goes back to clean
reweighted item id 0 name 'osd.0' to 8 in crush map
[root@xorphosd00 ~]# ceph -s
    cluster:
      id:     86d8a1b9-761b-4099-a960-6a303b951236
      health: HEALTH_WARN
              noout,nobackfill,noscrub,nodeep-scrub flag(s) set

    services:
      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
      osd: 270 osds: 270 up (since 11m), 270 in (since 5h)
           flags noout,nobackfill,noscrub,nodeep-scrub

    data:
      pools:   4 pools, 5312 pgs
      objects: 75.87M objects, 287 TiB
      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
      pgs:     5312 active+clean

#
# Now the problematic case
#

# Stop an OSD
[root@xorphosd00 ~]# systemctl stop ceph-osd@0

# We get degraded PGs - as expected
[root@xorphosd00 ~]# ceph -s
    cluster:
      id:     86d8a1b9-761b-4099-a960-6a303b951236
      health: HEALTH_WARN
              noout,nobackfill,noscrub,nodeep-scrub flag(s) set
              1 osds down
              Degraded data redundancy: 873964/232996662 objects degraded
(0.375%), 82 pgs degraded

    services:
      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
      osd: 270 osds: 269 up (since 16s), 270 in (since 5h)
           flags noout,nobackfill,noscrub,nodeep-scrub

    data:
      pools:   4 pools, 5312 pgs
      objects: 75.87M objects, 287 TiB
      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
      pgs:     873964/232996662 objects degraded (0.375%)
               5230 active+clean
               82   active+undersized+degraded

# Reweight the OSD to 0:
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0

# Still degraded - as expected
reweighted item id 0 name 'osd.0' to 0 in crush map
[root@xorphosd00 ~]# ceph -s
    cluster:
      id:     86d8a1b9-761b-4099-a960-6a303b951236
      health: HEALTH_WARN
              noout,nobackfill,noscrub,nodeep-scrub flag(s) set
              1 osds down
              Degraded data redundancy: 873964/232996662 objects degraded
(0.375%), 82 pgs degraded

    services:
      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
      osd: 270 osds: 269 up (since 59s), 270 in (since 5h); 175 remapped pgs
           flags noout,nobackfill,noscrub,nodeep-scrub

    data:
      pools:   4 pools, 5312 pgs
      objects: 75.87M objects, 287 TiB
      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
      pgs:     873964/232996662 objects degraded (0.375%)
               1688081/232996662 objects misplaced (0.725%)
               5137 active+clean
               93   active+remapped+backfilling
               82   active+undersized+degraded+remapped+backfilling

# Restarting the OSD
[root@xorphosd00 ~]# systemctl start ceph-osd@0

# And the PGs still stay degraded - THIS IS UNEXPECTED!!!
[root@xorphosd00 ~]# ceph -s
    cluster:
      id:     86d8a1b9-761b-4099-a960-6a303b951236
      health: HEALTH_WARN
              noout,nobackfill,noscrub,nodeep-scrub flag(s) set
              Degraded data redundancy: 873964/232996662 objects degraded
(0.375%), 82 pgs degraded

    services:
      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
      osd: 270 osds: 270 up (since 14s), 270 in (since 5h); 175 remapped pgs
           flags noout,nobackfill,noscrub,nodeep-scrub

    data:
      pools:   4 pools, 5312 pgs
      objects: 75.87M objects, 287 TiB
      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
      pgs:     873964/232996662 objects degraded (0.375%)
               1688081/232996662 objects misplaced (0.725%)
               5137 active+clean
               93   active+remapped+backfilling
               82   active+undersized+degraded+remapped+backfilling

# Now for something even more odd - reweight the OSD back to its
original weigh
# and all the data gets magically FOUND again on that OSD!!!
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
reweighted item id 0 name 'osd.0' to 8 in crush map
[root@xorphosd00 ~]# ceph -s
    cluster:
      id:     86d8a1b9-761b-4099-a960-6a303b951236
      health: HEALTH_WARN
              noout,nobackfill,noscrub,nodeep-scrub flag(s) set

    services:
      mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
      mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
      mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
      osd: 270 osds: 270 up (since 51s), 270 in (since 5h)
           flags noout,nobackfill,noscrub,nodeep-scrub

    data:
      pools:   4 pools, 5312 pgs
      objects: 75.87M objects, 287 TiB
      usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
      pgs:     5312 active+clean

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx