Re: Reweighting OSD while down results in undersized+degraded PGs

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Wed, 20 May 2020 10:18:57 -0400

Hi Frank,

Thanks for the explanation - I wasn't aware of this subtle point. So 
when some OSDs are down, one has to be very careful with changing the 
cluster then.  I guess one could even end up with incomplete PGs this 
way that ceph can't recover from in an automated fashion?

Andras

On 5/19/20 11:23 AM, Frank Schilder wrote:
Hi Andreas,

the cluster map and crush map are not the same thing. If you change the crush map while the cluster is in degraded state, you basically modify this history of cluster maps explicitly and have to live with the consequences (keeping history under crush map changes is limited to up+in OSDs). Only OSDs that can peer are able to respond to changes of the crush map.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx>
Sent: 19 May 2020 15:57:49
To: Frank Schilder; ceph-users
Subject: Re:  Reweighting OSD while down results in undersized+degraded PGs

Hi Frank,

My understanding was that once a cluster is in a degraded state (an OSD
is down), ceph stores all changed cluster maps until the cluster is
healthy again exactly for the reason of finding missing objects. If
there is a real disaster of some kind, and many OSDs go up and down at
various times, we have to have a way of retracing where parts of a PG
were in the past.  And ... most of the time this does work - I just
don't understand why this current scenario is different.

Andras

On 5/19/20 3:49 AM, Frank Schilder wrote:
Hi Andreas,

I made exactly the same observation in another scenario. I added some OSDs while other OSDs were down.

This is expected.

The crush map is an a-priory algorithm to compute the location of objects without contacting a central server. Hence, *any*change of a crush map while an OSD is down will lead to a change of locations of objects/PGs of the down OSD. Consequently, these objects/PGs will become degraded, because no up OSD reports these. Once the peering is over after setting the weight to 0, the cluster must assume they are lost.

Changing a weight is a change of the crush map.

The way to get the cluster to re-scan after starting the down OSDs is to restore the crush map to exactly the state as it was before the OSD went down. In your case,

- starting the down OSD,
- setting weight back to original value

will find all missing objects. After the cluster is clean, set the weight back to 0 and now the OSD will be vacated as expected.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx>
Sent: 18 May 2020 22:25:37
To: ceph-users
Subject:  Reweighting OSD while down results in undersized+degraded PGs

In a recent cluster reorganization, we ended up with a lot of
undersized/degraded PGs and a day of recovery from them, when all we
expected was moving some data around.  After retracing my steps, I found
something odd.  If I crush reweight an OSD to  0 while it is down - it
results in the PGs of that OSD ending up degraded even after the OSD is
restarted.  If I do the same reweighting while the OSD is up - data gets
moved without any degraded/undersized states. I would not expect this -
so I wonder if this is a bug or is somehow intended.  This is on ceph
Nautilus 14.2.8.  Below are the details.

Andras

First the case that works as I would expect:

# Healthy cluster ...
[root@xorphosd00 ~]# ceph -s
     cluster:
       id:     86d8a1b9-761b-4099-a960-6a303b951236
       health: HEALTH_WARN
               noout,nobackfill,noscrub,nodeep-scrub flag(s) set

     services:
       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
       osd: 270 osds: 270 up (since 2m), 270 in (since 4h)
            flags noout,nobackfill,noscrub,nodeep-scrub

     data:
       pools:   4 pools, 5312 pgs
       objects: 75.87M objects, 287 TiB
       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
       pgs:     5312 active+clean

# Reweight an OSD to 0
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
reweighted item id 0 name 'osd.0' to 0 in crush map

# Crush map changes - data movement is set up, no degraded PGs:
[root@xorphosd00 ~]# ceph -s
     cluster:
       id:     86d8a1b9-761b-4099-a960-6a303b951236
       health: HEALTH_WARN
               noout,nobackfill,noscrub,nodeep-scrub flag(s) set

     services:
       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
       osd: 270 osds: 270 up (since 10m), 270 in (since 5h); 175 remapped pgs
            flags noout,nobackfill,noscrub,nodeep-scrub

     data:
       pools:   4 pools, 5312 pgs
       objects: 75.87M objects, 287 TiB
       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
       pgs:     2562045/232996662 objects misplaced (1.100%)
                5137 active+clean
                172  active+remapped+backfilling
                3    active+remapped+backfill_wait

# Reweight it back to the original weight
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0

# Cluster goes back to clean
reweighted item id 0 name 'osd.0' to 8 in crush map
[root@xorphosd00 ~]# ceph -s
     cluster:
       id:     86d8a1b9-761b-4099-a960-6a303b951236
       health: HEALTH_WARN
               noout,nobackfill,noscrub,nodeep-scrub flag(s) set

     services:
       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
       osd: 270 osds: 270 up (since 11m), 270 in (since 5h)
            flags noout,nobackfill,noscrub,nodeep-scrub

     data:
       pools:   4 pools, 5312 pgs
       objects: 75.87M objects, 287 TiB
       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
       pgs:     5312 active+clean

#
# Now the problematic case
#

# Stop an OSD
[root@xorphosd00 ~]# systemctl stop ceph-osd@0

# We get degraded PGs - as expected
[root@xorphosd00 ~]# ceph -s
     cluster:
       id:     86d8a1b9-761b-4099-a960-6a303b951236
       health: HEALTH_WARN
               noout,nobackfill,noscrub,nodeep-scrub flag(s) set
               1 osds down
               Degraded data redundancy: 873964/232996662 objects degraded
(0.375%), 82 pgs degraded

     services:
       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
       osd: 270 osds: 269 up (since 16s), 270 in (since 5h)
            flags noout,nobackfill,noscrub,nodeep-scrub

     data:
       pools:   4 pools, 5312 pgs
       objects: 75.87M objects, 287 TiB
       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
       pgs:     873964/232996662 objects degraded (0.375%)
                5230 active+clean
                82   active+undersized+degraded

# Reweight the OSD to 0:
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0

# Still degraded - as expected
reweighted item id 0 name 'osd.0' to 0 in crush map
[root@xorphosd00 ~]# ceph -s
     cluster:
       id:     86d8a1b9-761b-4099-a960-6a303b951236
       health: HEALTH_WARN
               noout,nobackfill,noscrub,nodeep-scrub flag(s) set
               1 osds down
               Degraded data redundancy: 873964/232996662 objects degraded
(0.375%), 82 pgs degraded

     services:
       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
       osd: 270 osds: 269 up (since 59s), 270 in (since 5h); 175 remapped pgs
            flags noout,nobackfill,noscrub,nodeep-scrub

     data:
       pools:   4 pools, 5312 pgs
       objects: 75.87M objects, 287 TiB
       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
       pgs:     873964/232996662 objects degraded (0.375%)
                1688081/232996662 objects misplaced (0.725%)
                5137 active+clean
                93   active+remapped+backfilling
                82   active+undersized+degraded+remapped+backfilling

# Restarting the OSD
[root@xorphosd00 ~]# systemctl start ceph-osd@0

# And the PGs still stay degraded - THIS IS UNEXPECTED!!!
[root@xorphosd00 ~]# ceph -s
     cluster:
       id:     86d8a1b9-761b-4099-a960-6a303b951236
       health: HEALTH_WARN
               noout,nobackfill,noscrub,nodeep-scrub flag(s) set
               Degraded data redundancy: 873964/232996662 objects degraded
(0.375%), 82 pgs degraded

     services:
       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
       osd: 270 osds: 270 up (since 14s), 270 in (since 5h); 175 remapped pgs
            flags noout,nobackfill,noscrub,nodeep-scrub

     data:
       pools:   4 pools, 5312 pgs
       objects: 75.87M objects, 287 TiB
       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
       pgs:     873964/232996662 objects degraded (0.375%)
                1688081/232996662 objects misplaced (0.725%)
                5137 active+clean
                93   active+remapped+backfilling
                82   active+undersized+degraded+remapped+backfilling

# Now for something even more odd - reweight the OSD back to its
original weigh
# and all the data gets magically FOUND again on that OSD!!!
[root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
reweighted item id 0 name 'osd.0' to 8 in crush map
[root@xorphosd00 ~]# ceph -s
     cluster:
       id:     86d8a1b9-761b-4099-a960-6a303b951236
       health: HEALTH_WARN
               noout,nobackfill,noscrub,nodeep-scrub flag(s) set

     services:
       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
       osd: 270 osds: 270 up (since 51s), 270 in (since 5h)
            flags noout,nobackfill,noscrub,nodeep-scrub

     data:
       pools:   4 pools, 5312 pgs
       objects: 75.87M objects, 287 TiB
       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
       pgs:     5312 active+clean

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx