Re: Troubleshooting stuck unclean PGs?

Wout van Heeswijk <wout@xxxxxxxx> · Mon, 21 Sep 2020 19:21:38 +0000

Hi Matt,

The mon data can grow during when PGs are stuck unclean. Don't restart the mons.

You need to find out why your placement groups are "backfill_wait". Likely some of your OSDs are (near)full.

If you have space elsewhere you can use the ceph balancer module or reweighting of OSDs to rebalance data.

Scrubbing will continue once the PGs are "active+clean"

Kind regards,

Wout
42on

________________________________________
From: Matt Larson <larsonmattr@xxxxxxxxx>
Sent: Monday, September 21, 2020 6:22 PM
To: ceph-users@xxxxxxx
Subject:  Troubleshooting stuck unclean PGs?

Hi,

 Our Ceph cluster is reporting several PGs that have not been scrubbed
or deep scrubbed in time. It is over a week for these PGs to have been
scrubbed. When I checked the `ceph health detail`, there are 29 pgs
not deep-scrubbed in time and 22 pgs not scrubbed in time. I tried to
manually start a scrub on the PGs, but it appears that they are
actually in an unclean state that needs to be resolved first.

This is a cluster running:
 ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)

 Following the information at [Troubleshooting
PGs](https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/),
I checked for PGs that are stuck stale | inactive | unclean. There
were no PGs that are stale or inactive, but there are several that are
stuck unclean:

 ```
PG_STAT  STATE                          UP
   UP_PRIMARY  ACTING                            ACTING_PRIMARY
8.3c     active+remapped+backfill_wait
[124,41,108,8,87,16,79,157,49]         124
[139,57,16,125,154,65,109,86,45]             139
8.3e     active+remapped+backfill_wait
[108,2,58,146,130,29,37,66,118]         108
[127,92,24,50,33,6,130,66,149]             127
8.3f     active+remapped+backfill_wait
[19,34,86,132,59,78,153,99,6]          19
[90,45,147,4,105,61,30,66,125]              90
8.40     active+remapped+backfill_wait
[19,131,80,76,42,101,61,3,144]          19
[28,106,132,3,151,36,65,60,83]              28
8.3a       active+remapped+backfilling
[32,72,151,30,103,131,62,84,120]          32
[91,60,7,133,101,117,78,20,158]              91
8.7e     active+remapped+backfill_wait
[108,2,58,146,130,29,37,66,118]         108
[127,92,24,50,33,6,130,66,149]             127
8.3b     active+remapped+backfill_wait
[34,113,148,63,18,95,70,129,13]          34
[66,17,132,90,14,52,101,47,115]              66
8.7f     active+remapped+backfill_wait
[19,34,86,132,59,78,153,99,6]          19
[90,45,147,4,105,61,30,66,125]              90
8.78     active+remapped+backfill_wait
[96,113,159,63,29,133,73,8,89]          96
[138,121,15,103,55,41,146,69,18]             138
8.7d       active+remapped+backfilling
[0,90,60,124,159,19,71,101,135]           0
[150,72,124,129,63,10,94,29,41]             150
8.7c     active+remapped+backfill_wait
[124,41,108,8,87,16,79,157,49]         124
[139,57,16,125,154,65,109,86,45]             139
8.79     active+remapped+backfill_wait
[59,15,41,82,131,20,73,156,113]          59
[13,51,120,102,29,149,42,79,132]              13
```

If I query one of the PGs that is backfilling, 8.3a, it shows it's state as :
    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2020-09-19T20:45:44.027759+0000",
            "might_have_unfound": [],
            "recovery_progress": {
                "backfill_targets": [
                    "30(3)",
                    "32(0)",
                    "62(6)",
                    "72(1)",
                    "84(7)",
                    "103(4)",
                    "120(8)",
                    "131(5)",
                    "151(2)"
                ],

Q1: Is there anything that I should check/fix to enable the PGs to
resolve from the `unclean` state?
Q2: I have also seen that the podman containers on one of our OSD
servers are taking large amounts of disk space. Is there a way to
limit the growth of disk space for podman containers, when
administering a Ceph cluster using `cephadm` tools? At last check, a
server running 16 OSDs and 1 MON is using 39G of disk space for its
running containers. Can restarting containers help to start with a
fresh slate or reduce the disk use?

Thanks,
  Matt

------------------------

Matt Larson
Associate Scientist
Computer Scientist/System Administrator
UW-Madison Cryo-EM Research Center
433 Babcock Drive, Madison, WI 53706
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx