Re: 1 PG remains remapped after recovery

Tyler Stachecki <stachecki.tyler@xxxxxxxxx> · Sat, 27 Aug 2022 13:27:36 -0400

You seem to have an OSD that's down and out (status says 9 OSDs, 8 up and
in). One possibility is that the pg is not able to fully recover because of
existing CRUSH rules and the virtue that the only OSD that could store the
last replica is down and out.

So, what do your CRUSH rules and replication look like?

Tyler

On Sat, Aug 27, 2022, 1:20 PM Frank Schilder <frans@xxxxxx> wrote:

> Hi all,
>
> our test cluster (octopus 15.2.16) ended up in a weird state:
>
>   cluster:
>     id:     bf1f51f5-b381-4cf7-b3db-88d044c1960c
>     health: HEALTH_OK
>
>   services:
>     mon: 3 daemons, quorum tceph-01,tceph-03,tceph-02 (age 4w)
>     mgr: tceph-01(active, since 4w), standbys: tceph-02, tceph-03
>     mds: fs:1 {0=tceph-02=up:active} 2 up:standby
>     osd: 9 osds: 8 up (since 29h), 8 in (since 28h); 1 remapped pgs
>
>   data:
>     pools:   4 pools, 321 pgs
>     objects: 10.40M objects, 348 GiB
>     usage:   1.7 TiB used, 442 GiB / 2.2 TiB avail
>     pgs:     39434/46694661 objects misplaced (0.084%)
>              205 active+clean+snaptrim_wait
>              99  active+clean
>              16  active+clean+snaptrim
>              1   active+clean+remapped+snaptrim_wait
>
>   io:
>     client:   19 KiB/s rd, 22 MiB/s wr, 2 op/s rd, 174 op/s wr
>
> As part of the testing we failed an OSD to benchmark client IO under
> recovery. Strangely enough, after the cluster recovered, 1 PG remains in
> state remapped. Despite that, health is OK. This seems problematic, because
> the PG will probably accumulate PG_LOG entries until the remapped state is
> cleared. The history versions look already wildly different. Here the full
> PG state:
>
> PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES
>      OMAP_BYTES*  OMAP_KEYS*  LOG   DISK_LOG  STATE
>         STATE_STAMP                      VERSION       REPORTED      UP
>             UP_PRIMARY  ACTING         ACTING_PRIMARY  LAST_SCRUB
> SCRUB_STAMP                      LAST_DEEP_SCRUB  DEEP_SCRUB_STAMP
>        SNAPTRIMQ_LEN
> 4.1c       39438                   0         0      39438        0
> 2825704691            0           0  1933      1933
> active+clean+remapped+snaptrim_wait  2022-08-27T19:05:15.144083+0200
> 4170'3108053  4170:3022415  [6,1,4,5,3,NONE]           6  [6,1,4,5,3,1]
>            6  3312'2843531  2022-08-24T22:40:42.482024+0200
>  2832'2067159  2022-08-21T02:13:17.023702+0200             49
>
> Any ideas why this PG is stuck in remapped and does not rebalance objects?
> Is there a way to convince it to start rebalancing?
>
> Thanks and Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx