I tried this: `sudo ceph tell 'osd.*' injectargs '--osd-max-backfills 4'` Which has increased to having 10 simultaneous backfills and a higher 10X higher rate of data movements. It looks like I could increase this further by increasing the number of simultaneous recovery operations, but changing that parameter to 20 didn't cause a change. The command warned that OSDs may need to be restarted before this takes effect: sudo ceph tell 'osd.*' injectargs '--osd-recovery-max-active 20' I'll let it run overnight with a higher backfill rate and see if that is sufficient to let the cluster catch up. The commands are from (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023844.html) -Matt On Mon, Sep 21, 2020 at 7:20 PM Matt Larson <larsonmattr@xxxxxxxxx> wrote: > > Hi Wout, > > None of the OSDs are greater than 20% full. However, only 1 PG is > backfilling at a time, while the others are backfill_wait. I had > recently added a large amount of data to the Ceph cluster, and this > may have caused the # of PGs to increase causing the need to rebalance > or move objects. > > It appears that I could increase the # of backfill operations that > happen simultaneously by increasing `osd_max_backfills` and/or > `osd_recovery_max_active`. It looks like I should maybe consider > increasing the number of max backfills happening at a time because the > overall io during the backfill is pretty small. > > Does this seem reasonable? If so, with Ceph Octopus/cephadm, how can > adjust the parameters? > > Thanks, > Matt > > On Mon, Sep 21, 2020 at 2:21 PM Wout van Heeswijk <wout@xxxxxxxx> wrote: > > > > Hi Matt, > > > > The mon data can grow during when PGs are stuck unclean. Don't restart the mons. > > > > You need to find out why your placement groups are "backfill_wait". Likely some of your OSDs are (near)full. > > > > If you have space elsewhere you can use the ceph balancer module or reweighting of OSDs to rebalance data. > > > > Scrubbing will continue once the PGs are "active+clean" > > > > Kind regards, > > > > Wout > > 42on > > > > ________________________________________ > > From: Matt Larson <larsonmattr@xxxxxxxxx> > > Sent: Monday, September 21, 2020 6:22 PM > > To: ceph-users@xxxxxxx > > Subject: Troubleshooting stuck unclean PGs? > > > > Hi, > > > > Our Ceph cluster is reporting several PGs that have not been scrubbed > > or deep scrubbed in time. It is over a week for these PGs to have been > > scrubbed. When I checked the `ceph health detail`, there are 29 pgs > > not deep-scrubbed in time and 22 pgs not scrubbed in time. I tried to > > manually start a scrub on the PGs, but it appears that they are > > actually in an unclean state that needs to be resolved first. > > > > This is a cluster running: > > ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable) > > > > Following the information at [Troubleshooting > > PGs](https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/), > > I checked for PGs that are stuck stale | inactive | unclean. There > > were no PGs that are stale or inactive, but there are several that are > > stuck unclean: > > > > ``` > > PG_STAT STATE UP > > UP_PRIMARY ACTING ACTING_PRIMARY > > 8.3c active+remapped+backfill_wait > > [124,41,108,8,87,16,79,157,49] 124 > > [139,57,16,125,154,65,109,86,45] 139 > > 8.3e active+remapped+backfill_wait > > [108,2,58,146,130,29,37,66,118] 108 > > [127,92,24,50,33,6,130,66,149] 127 > > 8.3f active+remapped+backfill_wait > > [19,34,86,132,59,78,153,99,6] 19 > > [90,45,147,4,105,61,30,66,125] 90 > > 8.40 active+remapped+backfill_wait > > [19,131,80,76,42,101,61,3,144] 19 > > [28,106,132,3,151,36,65,60,83] 28 > > 8.3a active+remapped+backfilling > > [32,72,151,30,103,131,62,84,120] 32 > > [91,60,7,133,101,117,78,20,158] 91 > > 8.7e active+remapped+backfill_wait > > [108,2,58,146,130,29,37,66,118] 108 > > [127,92,24,50,33,6,130,66,149] 127 > > 8.3b active+remapped+backfill_wait > > [34,113,148,63,18,95,70,129,13] 34 > > [66,17,132,90,14,52,101,47,115] 66 > > 8.7f active+remapped+backfill_wait > > [19,34,86,132,59,78,153,99,6] 19 > > [90,45,147,4,105,61,30,66,125] 90 > > 8.78 active+remapped+backfill_wait > > [96,113,159,63,29,133,73,8,89] 96 > > [138,121,15,103,55,41,146,69,18] 138 > > 8.7d active+remapped+backfilling > > [0,90,60,124,159,19,71,101,135] 0 > > [150,72,124,129,63,10,94,29,41] 150 > > 8.7c active+remapped+backfill_wait > > [124,41,108,8,87,16,79,157,49] 124 > > [139,57,16,125,154,65,109,86,45] 139 > > 8.79 active+remapped+backfill_wait > > [59,15,41,82,131,20,73,156,113] 59 > > [13,51,120,102,29,149,42,79,132] 13 > > ``` > > > > If I query one of the PGs that is backfilling, 8.3a, it shows it's state as : > > "recovery_state": [ > > { > > "name": "Started/Primary/Active", > > "enter_time": "2020-09-19T20:45:44.027759+0000", > > "might_have_unfound": [], > > "recovery_progress": { > > "backfill_targets": [ > > "30(3)", > > "32(0)", > > "62(6)", > > "72(1)", > > "84(7)", > > "103(4)", > > "120(8)", > > "131(5)", > > "151(2)" > > ], > > > > Q1: Is there anything that I should check/fix to enable the PGs to > > resolve from the `unclean` state? > > Q2: I have also seen that the podman containers on one of our OSD > > servers are taking large amounts of disk space. Is there a way to > > limit the growth of disk space for podman containers, when > > administering a Ceph cluster using `cephadm` tools? At last check, a > > server running 16 OSDs and 1 MON is using 39G of disk space for its > > running containers. Can restarting containers help to start with a > > fresh slate or reduce the disk use? > > > > Thanks, > > Matt > > > > ------------------------ > > > > Matt Larson > > Associate Scientist > > Computer Scientist/System Administrator > > UW-Madison Cryo-EM Research Center > > 433 Babcock Drive, Madison, WI 53706 > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > -- > Matt Larson, PhD > Madison, WI 53705 U.S.A. -- Matt Larson, PhD Madison, WI 53705 U.S.A. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx