Hi Matt, The mon data can grow during when PGs are stuck unclean. Don't restart the mons. You need to find out why your placement groups are "backfill_wait". Likely some of your OSDs are (near)full. If you have space elsewhere you can use the ceph balancer module or reweighting of OSDs to rebalance data. Scrubbing will continue once the PGs are "active+clean" Kind regards, Wout 42on ________________________________________ From: Matt Larson <larsonmattr@xxxxxxxxx> Sent: Monday, September 21, 2020 6:22 PM To: ceph-users@xxxxxxx Subject: Troubleshooting stuck unclean PGs? Hi, Our Ceph cluster is reporting several PGs that have not been scrubbed or deep scrubbed in time. It is over a week for these PGs to have been scrubbed. When I checked the `ceph health detail`, there are 29 pgs not deep-scrubbed in time and 22 pgs not scrubbed in time. I tried to manually start a scrub on the PGs, but it appears that they are actually in an unclean state that needs to be resolved first. This is a cluster running: ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable) Following the information at [Troubleshooting PGs](https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/), I checked for PGs that are stuck stale | inactive | unclean. There were no PGs that are stale or inactive, but there are several that are stuck unclean: ``` PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 8.3c active+remapped+backfill_wait [124,41,108,8,87,16,79,157,49] 124 [139,57,16,125,154,65,109,86,45] 139 8.3e active+remapped+backfill_wait [108,2,58,146,130,29,37,66,118] 108 [127,92,24,50,33,6,130,66,149] 127 8.3f active+remapped+backfill_wait [19,34,86,132,59,78,153,99,6] 19 [90,45,147,4,105,61,30,66,125] 90 8.40 active+remapped+backfill_wait [19,131,80,76,42,101,61,3,144] 19 [28,106,132,3,151,36,65,60,83] 28 8.3a active+remapped+backfilling [32,72,151,30,103,131,62,84,120] 32 [91,60,7,133,101,117,78,20,158] 91 8.7e active+remapped+backfill_wait [108,2,58,146,130,29,37,66,118] 108 [127,92,24,50,33,6,130,66,149] 127 8.3b active+remapped+backfill_wait [34,113,148,63,18,95,70,129,13] 34 [66,17,132,90,14,52,101,47,115] 66 8.7f active+remapped+backfill_wait [19,34,86,132,59,78,153,99,6] 19 [90,45,147,4,105,61,30,66,125] 90 8.78 active+remapped+backfill_wait [96,113,159,63,29,133,73,8,89] 96 [138,121,15,103,55,41,146,69,18] 138 8.7d active+remapped+backfilling [0,90,60,124,159,19,71,101,135] 0 [150,72,124,129,63,10,94,29,41] 150 8.7c active+remapped+backfill_wait [124,41,108,8,87,16,79,157,49] 124 [139,57,16,125,154,65,109,86,45] 139 8.79 active+remapped+backfill_wait [59,15,41,82,131,20,73,156,113] 59 [13,51,120,102,29,149,42,79,132] 13 ``` If I query one of the PGs that is backfilling, 8.3a, it shows it's state as : "recovery_state": [ { "name": "Started/Primary/Active", "enter_time": "2020-09-19T20:45:44.027759+0000", "might_have_unfound": [], "recovery_progress": { "backfill_targets": [ "30(3)", "32(0)", "62(6)", "72(1)", "84(7)", "103(4)", "120(8)", "131(5)", "151(2)" ], Q1: Is there anything that I should check/fix to enable the PGs to resolve from the `unclean` state? Q2: I have also seen that the podman containers on one of our OSD servers are taking large amounts of disk space. Is there a way to limit the growth of disk space for podman containers, when administering a Ceph cluster using `cephadm` tools? At last check, a server running 16 OSDs and 1 MON is using 39G of disk space for its running containers. Can restarting containers help to start with a fresh slate or reduce the disk use? Thanks, Matt ------------------------ Matt Larson Associate Scientist Computer Scientist/System Administrator UW-Madison Cryo-EM Research Center 433 Babcock Drive, Madison, WI 53706 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx