Re: Reweighting OSD while down results in undersized+degraded PGs

Frank Schilder <frans@xxxxxx> · Tue, 19 May 2020 15:23:20 +0000

Hi Andreas,

the cluster map and crush map are not the same thing. If you change the crush map while the cluster is in degraded state, you basically modify this history of cluster maps explicitly and have to live with the consequences (keeping history under crush map changes is limited to up+in OSDs). Only OSDs that can peer are able to respond to changes of the crush map.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx>
Sent: 19 May 2020 15:57:49
To: Frank Schilder; ceph-users
Subject: Re:  Reweighting OSD while down results in undersized+degraded PGs

Hi Frank,

My understanding was that once a cluster is in a degraded state (an OSD
is down), ceph stores all changed cluster maps until the cluster is
healthy again exactly for the reason of finding missing objects. If
there is a real disaster of some kind, and many OSDs go up and down at
various times, we have to have a way of retracing where parts of a PG
were in the past.  And ... most of the time this does work - I just
don't understand why this current scenario is different.

Andras

On 5/19/20 3:49 AM, Frank Schilder wrote:
> Hi Andreas,
>
> I made exactly the same observation in another scenario. I added some OSDs while other OSDs were down.
>
> This is expected.
>
> The crush map is an a-priory algorithm to compute the location of objects without contacting a central server. Hence, *any*change of a crush map while an OSD is down will lead to a change of locations of objects/PGs of the down OSD. Consequently, these objects/PGs will become degraded, because no up OSD reports these. Once the peering is over after setting the weight to 0, the cluster must assume they are lost.
>
> Changing a weight is a change of the crush map.
>
> The way to get the cluster to re-scan after starting the down OSDs is to restore the crush map to exactly the state as it was before the OSD went down. In your case,
>
> - starting the down OSD,
> - setting weight back to original value
>
> will find all missing objects. After the cluster is clean, set the weight back to 0 and now the OSD will be vacated as expected.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx>
> Sent: 18 May 2020 22:25:37
> To: ceph-users
> Subject:  Reweighting OSD while down results in undersized+degraded PGs
>
> In a recent cluster reorganization, we ended up with a lot of
> undersized/degraded PGs and a day of recovery from them, when all we
> expected was moving some data around.  After retracing my steps, I found
> something odd.  If I crush reweight an OSD to  0 while it is down - it
> results in the PGs of that OSD ending up degraded even after the OSD is
> restarted.  If I do the same reweighting while the OSD is up - data gets
> moved without any degraded/undersized states. I would not expect this -
> so I wonder if this is a bug or is somehow intended.  This is on ceph
> Nautilus 14.2.8.  Below are the details.
>
> Andras
>
>
> First the case that works as I would expect:
>
> # Healthy cluster ...
> [root@xorphosd00 ~]# ceph -s
>     cluster:
>       id:     86d8a1b9-761b-4099-a960-6a303b951236
>       health: HEALTH_WARN
>               noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
>     services:
>       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>       osd: 270 osds: 270 up (since 2m), 270 in (since 4h)
>            flags noout,nobackfill,noscrub,nodeep-scrub
>
>     data:
>       pools:   4 pools, 5312 pgs
>       objects: 75.87M objects, 287 TiB
>       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>       pgs:     5312 active+clean
>
> # Reweight an OSD to 0
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
> reweighted item id 0 name 'osd.0' to 0 in crush map
>
> # Crush map changes - data movement is set up, no degraded PGs:
> [root@xorphosd00 ~]# ceph -s
>     cluster:
>       id:     86d8a1b9-761b-4099-a960-6a303b951236
>       health: HEALTH_WARN
>               noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
>     services:
>       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>       osd: 270 osds: 270 up (since 10m), 270 in (since 5h); 175 remapped pgs
>            flags noout,nobackfill,noscrub,nodeep-scrub
>
>     data:
>       pools:   4 pools, 5312 pgs
>       objects: 75.87M objects, 287 TiB
>       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>       pgs:     2562045/232996662 objects misplaced (1.100%)
>                5137 active+clean
>                172  active+remapped+backfilling
>                3    active+remapped+backfill_wait
>
> # Reweight it back to the original weight
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
>
> # Cluster goes back to clean
> reweighted item id 0 name 'osd.0' to 8 in crush map
> [root@xorphosd00 ~]# ceph -s
>     cluster:
>       id:     86d8a1b9-761b-4099-a960-6a303b951236
>       health: HEALTH_WARN
>               noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
>     services:
>       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>       osd: 270 osds: 270 up (since 11m), 270 in (since 5h)
>            flags noout,nobackfill,noscrub,nodeep-scrub
>
>     data:
>       pools:   4 pools, 5312 pgs
>       objects: 75.87M objects, 287 TiB
>       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>       pgs:     5312 active+clean
>
>
>
>
> #
> # Now the problematic case
> #
>
> # Stop an OSD
> [root@xorphosd00 ~]# systemctl stop ceph-osd@0
>
> # We get degraded PGs - as expected
> [root@xorphosd00 ~]# ceph -s
>     cluster:
>       id:     86d8a1b9-761b-4099-a960-6a303b951236
>       health: HEALTH_WARN
>               noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>               1 osds down
>               Degraded data redundancy: 873964/232996662 objects degraded
> (0.375%), 82 pgs degraded
>
>     services:
>       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>       osd: 270 osds: 269 up (since 16s), 270 in (since 5h)
>            flags noout,nobackfill,noscrub,nodeep-scrub
>
>     data:
>       pools:   4 pools, 5312 pgs
>       objects: 75.87M objects, 287 TiB
>       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>       pgs:     873964/232996662 objects degraded (0.375%)
>                5230 active+clean
>                82   active+undersized+degraded
>
> # Reweight the OSD to 0:
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
>
> # Still degraded - as expected
> reweighted item id 0 name 'osd.0' to 0 in crush map
> [root@xorphosd00 ~]# ceph -s
>     cluster:
>       id:     86d8a1b9-761b-4099-a960-6a303b951236
>       health: HEALTH_WARN
>               noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>               1 osds down
>               Degraded data redundancy: 873964/232996662 objects degraded
> (0.375%), 82 pgs degraded
>
>     services:
>       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>       osd: 270 osds: 269 up (since 59s), 270 in (since 5h); 175 remapped pgs
>            flags noout,nobackfill,noscrub,nodeep-scrub
>
>     data:
>       pools:   4 pools, 5312 pgs
>       objects: 75.87M objects, 287 TiB
>       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>       pgs:     873964/232996662 objects degraded (0.375%)
>                1688081/232996662 objects misplaced (0.725%)
>                5137 active+clean
>                93   active+remapped+backfilling
>                82   active+undersized+degraded+remapped+backfilling
>
> # Restarting the OSD
> [root@xorphosd00 ~]# systemctl start ceph-osd@0
>
> # And the PGs still stay degraded - THIS IS UNEXPECTED!!!
> [root@xorphosd00 ~]# ceph -s
>     cluster:
>       id:     86d8a1b9-761b-4099-a960-6a303b951236
>       health: HEALTH_WARN
>               noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>               Degraded data redundancy: 873964/232996662 objects degraded
> (0.375%), 82 pgs degraded
>
>     services:
>       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>       osd: 270 osds: 270 up (since 14s), 270 in (since 5h); 175 remapped pgs
>            flags noout,nobackfill,noscrub,nodeep-scrub
>
>     data:
>       pools:   4 pools, 5312 pgs
>       objects: 75.87M objects, 287 TiB
>       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>       pgs:     873964/232996662 objects degraded (0.375%)
>                1688081/232996662 objects misplaced (0.725%)
>                5137 active+clean
>                93   active+remapped+backfilling
>                82   active+undersized+degraded+remapped+backfilling
>
> # Now for something even more odd - reweight the OSD back to its
> original weigh
> # and all the data gets magically FOUND again on that OSD!!!
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
> reweighted item id 0 name 'osd.0' to 8 in crush map
> [root@xorphosd00 ~]# ceph -s
>     cluster:
>       id:     86d8a1b9-761b-4099-a960-6a303b951236
>       health: HEALTH_WARN
>               noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
>     services:
>       mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
>       mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
>       mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
>       osd: 270 osds: 270 up (since 51s), 270 in (since 5h)
>            flags noout,nobackfill,noscrub,nodeep-scrub
>
>     data:
>       pools:   4 pools, 5312 pgs
>       objects: 75.87M objects, 287 TiB
>       usage:   864 TiB used, 1.1 PiB / 1.9 PiB avail
>       pgs:     5312 active+clean
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx