Troubleshooting stuck unclean PGs?

Matt Larson <larsonmattr@xxxxxxxxx> · Mon, 21 Sep 2020 11:22:29 -0500

Hi,

 Our Ceph cluster is reporting several PGs that have not been scrubbed
or deep scrubbed in time. It is over a week for these PGs to have been
scrubbed. When I checked the `ceph health detail`, there are 29 pgs
not deep-scrubbed in time and 22 pgs not scrubbed in time. I tried to
manually start a scrub on the PGs, but it appears that they are
actually in an unclean state that needs to be resolved first.

This is a cluster running:
 ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)

 Following the information at [Troubleshooting
PGs](https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/),
I checked for PGs that are stuck stale | inactive | unclean. There
were no PGs that are stale or inactive, but there are several that are
stuck unclean:

 ```
PG_STAT  STATE                          UP
   UP_PRIMARY  ACTING                            ACTING_PRIMARY
8.3c     active+remapped+backfill_wait
[124,41,108,8,87,16,79,157,49]         124
[139,57,16,125,154,65,109,86,45]             139
8.3e     active+remapped+backfill_wait
[108,2,58,146,130,29,37,66,118]         108
[127,92,24,50,33,6,130,66,149]             127
8.3f     active+remapped+backfill_wait
[19,34,86,132,59,78,153,99,6]          19
[90,45,147,4,105,61,30,66,125]              90
8.40     active+remapped+backfill_wait
[19,131,80,76,42,101,61,3,144]          19
[28,106,132,3,151,36,65,60,83]              28
8.3a       active+remapped+backfilling
[32,72,151,30,103,131,62,84,120]          32
[91,60,7,133,101,117,78,20,158]              91
8.7e     active+remapped+backfill_wait
[108,2,58,146,130,29,37,66,118]         108
[127,92,24,50,33,6,130,66,149]             127
8.3b     active+remapped+backfill_wait
[34,113,148,63,18,95,70,129,13]          34
[66,17,132,90,14,52,101,47,115]              66
8.7f     active+remapped+backfill_wait
[19,34,86,132,59,78,153,99,6]          19
[90,45,147,4,105,61,30,66,125]              90
8.78     active+remapped+backfill_wait
[96,113,159,63,29,133,73,8,89]          96
[138,121,15,103,55,41,146,69,18]             138
8.7d       active+remapped+backfilling
[0,90,60,124,159,19,71,101,135]           0
[150,72,124,129,63,10,94,29,41]             150
8.7c     active+remapped+backfill_wait
[124,41,108,8,87,16,79,157,49]         124
[139,57,16,125,154,65,109,86,45]             139
8.79     active+remapped+backfill_wait
[59,15,41,82,131,20,73,156,113]          59
[13,51,120,102,29,149,42,79,132]              13
```

If I query one of the PGs that is backfilling, 8.3a, it shows it's state as :
    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2020-09-19T20:45:44.027759+0000",
            "might_have_unfound": [],
            "recovery_progress": {
                "backfill_targets": [
                    "30(3)",
                    "32(0)",
                    "62(6)",
                    "72(1)",
                    "84(7)",
                    "103(4)",
                    "120(8)",
                    "131(5)",
                    "151(2)"
                ],

Q1: Is there anything that I should check/fix to enable the PGs to
resolve from the `unclean` state?
Q2: I have also seen that the podman containers on one of our OSD
servers are taking large amounts of disk space. Is there a way to
limit the growth of disk space for podman containers, when
administering a Ceph cluster using `cephadm` tools? At last check, a
server running 16 OSDs and 1 MON is using 39G of disk space for its
running containers. Can restarting containers help to start with a
fresh slate or reduce the disk use?

Thanks,
  Matt

------------------------

Matt Larson
Associate Scientist
Computer Scientist/System Administrator
UW-Madison Cryo-EM Research Center
433 Babcock Drive, Madison, WI 53706
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx