Hi Wout, None of the OSDs are greater than 20% full. However, only 1 PG is backfilling at a time, while the others are backfill_wait. I had recently added a large amount of data to the Ceph cluster, and this may have caused the # of PGs to increase causing the need to rebalance or move objects. It appears that I could increase the # of backfill operations that happen simultaneously by increasing `osd_max_backfills` and/or `osd_recovery_max_active`. It looks like I should maybe consider increasing the number of max backfills happening at a time because the overall io during the backfill is pretty small. Does this seem reasonable? If so, with Ceph Octopus/cephadm, how can adjust the parameters? Thanks, Matt On Mon, Sep 21, 2020 at 2:21 PM Wout van Heeswijk <wout@xxxxxxxx> wrote: > > Hi Matt, > > The mon data can grow during when PGs are stuck unclean. Don't restart the mons. > > You need to find out why your placement groups are "backfill_wait". Likely some of your OSDs are (near)full. > > If you have space elsewhere you can use the ceph balancer module or reweighting of OSDs to rebalance data. > > Scrubbing will continue once the PGs are "active+clean" > > Kind regards, > > Wout > 42on > > ________________________________________ > From: Matt Larson <larsonmattr@xxxxxxxxx> > Sent: Monday, September 21, 2020 6:22 PM > To: ceph-users@xxxxxxx > Subject: Troubleshooting stuck unclean PGs? > > Hi, > > Our Ceph cluster is reporting several PGs that have not been scrubbed > or deep scrubbed in time. It is over a week for these PGs to have been > scrubbed. When I checked the `ceph health detail`, there are 29 pgs > not deep-scrubbed in time and 22 pgs not scrubbed in time. I tried to > manually start a scrub on the PGs, but it appears that they are > actually in an unclean state that needs to be resolved first. > > This is a cluster running: > ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable) > > Following the information at [Troubleshooting > PGs](https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/), > I checked for PGs that are stuck stale | inactive | unclean. There > were no PGs that are stale or inactive, but there are several that are > stuck unclean: > > ``` > PG_STAT STATE UP > UP_PRIMARY ACTING ACTING_PRIMARY > 8.3c active+remapped+backfill_wait > [124,41,108,8,87,16,79,157,49] 124 > [139,57,16,125,154,65,109,86,45] 139 > 8.3e active+remapped+backfill_wait > [108,2,58,146,130,29,37,66,118] 108 > [127,92,24,50,33,6,130,66,149] 127 > 8.3f active+remapped+backfill_wait > [19,34,86,132,59,78,153,99,6] 19 > [90,45,147,4,105,61,30,66,125] 90 > 8.40 active+remapped+backfill_wait > [19,131,80,76,42,101,61,3,144] 19 > [28,106,132,3,151,36,65,60,83] 28 > 8.3a active+remapped+backfilling > [32,72,151,30,103,131,62,84,120] 32 > [91,60,7,133,101,117,78,20,158] 91 > 8.7e active+remapped+backfill_wait > [108,2,58,146,130,29,37,66,118] 108 > [127,92,24,50,33,6,130,66,149] 127 > 8.3b active+remapped+backfill_wait > [34,113,148,63,18,95,70,129,13] 34 > [66,17,132,90,14,52,101,47,115] 66 > 8.7f active+remapped+backfill_wait > [19,34,86,132,59,78,153,99,6] 19 > [90,45,147,4,105,61,30,66,125] 90 > 8.78 active+remapped+backfill_wait > [96,113,159,63,29,133,73,8,89] 96 > [138,121,15,103,55,41,146,69,18] 138 > 8.7d active+remapped+backfilling > [0,90,60,124,159,19,71,101,135] 0 > [150,72,124,129,63,10,94,29,41] 150 > 8.7c active+remapped+backfill_wait > [124,41,108,8,87,16,79,157,49] 124 > [139,57,16,125,154,65,109,86,45] 139 > 8.79 active+remapped+backfill_wait > [59,15,41,82,131,20,73,156,113] 59 > [13,51,120,102,29,149,42,79,132] 13 > ``` > > If I query one of the PGs that is backfilling, 8.3a, it shows it's state as : > "recovery_state": [ > { > "name": "Started/Primary/Active", > "enter_time": "2020-09-19T20:45:44.027759+0000", > "might_have_unfound": [], > "recovery_progress": { > "backfill_targets": [ > "30(3)", > "32(0)", > "62(6)", > "72(1)", > "84(7)", > "103(4)", > "120(8)", > "131(5)", > "151(2)" > ], > > Q1: Is there anything that I should check/fix to enable the PGs to > resolve from the `unclean` state? > Q2: I have also seen that the podman containers on one of our OSD > servers are taking large amounts of disk space. Is there a way to > limit the growth of disk space for podman containers, when > administering a Ceph cluster using `cephadm` tools? At last check, a > server running 16 OSDs and 1 MON is using 39G of disk space for its > running containers. Can restarting containers help to start with a > fresh slate or reduce the disk use? > > Thanks, > Matt > > ------------------------ > > Matt Larson > Associate Scientist > Computer Scientist/System Administrator > UW-Madison Cryo-EM Research Center > 433 Babcock Drive, Madison, WI 53706 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Matt Larson, PhD Madison, WI 53705 U.S.A. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx