Ceph Version : Mimic 13.2.4 The cluster has been running steadily for more than a year, recently I found cluster usage grows faster than usual .And we figured out the problem is garbage collection. 'radosgw-admin gc list ' has millions of objects to gc. the earliest tag time is 2019-09 , but 99% of them are from 2020-03 to now `ceph df` GLOBAL: SIZE AVAIL RAW USED %RAW USED 1.7 PiB 1.1 PiB 602 TiB 35.22 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS .rgw.root 10 1.2 KiB 0 421 TiB 4 default.rgw.control 11 0 B 0 421 TiB 8 default.rgw.data.root 12 0 B 0 421 TiB 0 default.rgw.gc 13 0 B 0 421 TiB 0 default.rgw.log 14 4.8 GiB 0 421 TiB 6414 default.rgw.intent-log 15 0 B 0 421 TiB 0 default.rgw.meta 16 110 KiB 0 421 TiB 463 default.rgw.usage 17 0 B 0 421 TiB 0 default.rgw.users.keys 18 0 B 0 421 TiB 0 default.rgw.users.email 19 0 B 0 421 TiB 0 default.rgw.users.swift 20 0 B 0 421 TiB 0 default.rgw.users.uid 21 0 B 0 421 TiB 0 default.rgw.buckets.extra 22 0 B 0 421 TiB 0 default.rgw.buckets.index 23 0 B 0 421 TiB 118720 default.rgw.buckets.data 24 263 TiB 38.41 421 TiB 138902771 default.rgw.buckets.non-ec 25 0 B 0 421 TiB 16678 however we counted each bucket usage by ' radosgw-admin bucket stats ' ,it should cost only 160TiB , about 80TiB are in GC list former gc config setting before we find gc problem: rgw_gc_max_objs = 32 rgw_gc_obj_min_wait = 3600 rgw_gc_processor_period = 3600 rgw_gc_processor_max_time = 3600 yesterday we adjust our setting and restart rgw: rgw_gc_max_objs = 1024 rgw_gc_obj_min_wait = 300 rgw_gc_processor_period = 600 rgw_gc_processor_max_time = 600 rgw_gc_max_concurrent_io = 40 rgw_gc_max_trim_chunk = 1024 today we use ' rados -p default.rgw.log listomapkeys gc.$i --cluster ceph -N gc | wc -l ' (i from 0 to 1023) well , only gc.0 to gc.511 has data here are some result sorted -time 14:43 result: …… 36 gc_202004111443/gc.502.tag 38 gc_202004111443/gc.501.tag 40 gc_202004111443/gc.136.tag 46 gc_202004111443/gc.511.tag 212 gc_202004111443/gc.9.tag 218 gc_202004111443/gc.24.tag 21976 gc_202004111443/gc.13.tag 42956 gc_202004111443/gc.26.tag 71772 gc_202004111443/gc.25.tag 85766 gc_202004111443/gc.6.tag 104504 gc_202004111443/gc.7.tag 105444 gc_202004111443/gc.10.tag 106114 gc_202004111443/gc.3.tag 126860 gc_202004111443/gc.31.tag 127352 gc_202004111443/gc.23.tag 147942 gc_202004111443/gc.27.tag 148046 gc_202004111443/gc.15.tag 167116 gc_202004111443/gc.28.tag 167932 gc_202004111443/gc.21.tag 187986 gc_202004111443/gc.5.tag 188312 gc_202004111443/gc.22.tag 209084 gc_202004111443/gc.30.tag 209152 gc_202004111443/gc.18.tag 209702 gc_202004111443/gc.19.tag 231100 gc_202004111443/gc.8.tag 249622 gc_202004111443/gc.14.tag 251092 gc_202004111443/gc.2.tag 251366 gc_202004111443/gc.12.tag 251802 gc_202004111443/gc.0.tag 252158 gc_202004111443/gc.11.tag 272114 gc_202004111443/gc.1.tag 291518 gc_202004111443/gc.20.tag 293646 gc_202004111443/gc.16.tag 312998 gc_202004111443/gc.17.tag 352984 gc_202004111443/gc.29.tag 488232 gc_202004111443/gc.4.tag 5935806 total -time 16:53 result: …… 28 gc_202004111653/gc.324.tag 28 gc_202004111653/gc.414.tag 30 gc_202004111653/gc.350.tag 30 gc_202004111653/gc.456.tag 204 gc_202004111653/gc.9.tag 208 gc_202004111653/gc.24.tag 21986 gc_202004111653/gc.13.tag 42964 gc_202004111653/gc.26.tag 71780 gc_202004111653/gc.25.tag 85778 gc_202004111653/gc.6.tag 104512 gc_202004111653/gc.7.tag 105452 gc_202004111653/gc.10.tag 106122 gc_202004111653/gc.3.tag 126866 gc_202004111653/gc.31.tag 127372 gc_202004111653/gc.23.tag 147944 gc_202004111653/gc.27.tag 148058 gc_202004111653/gc.15.tag 167124 gc_202004111653/gc.28.tag 167936 gc_202004111653/gc.21.tag 187992 gc_202004111653/gc.5.tag 188320 gc_202004111653/gc.22.tag 209090 gc_202004111653/gc.30.tag 209170 gc_202004111653/gc.18.tag 209704 gc_202004111653/gc.19.tag 231108 gc_202004111653/gc.8.tag 249632 gc_202004111653/gc.14.tag 251096 gc_202004111653/gc.2.tag 251376 gc_202004111653/gc.12.tag 251806 gc_202004111653/gc.0.tag 252170 gc_202004111653/gc.11.tag 272118 gc_202004111653/gc.1.tag 291526 gc_202004111653/gc.20.tag 293656 gc_202004111653/gc.16.tag 313000 gc_202004111653/gc.17.tag 352988 gc_202004111653/gc.29.tag 488238 gc_202004111653/gc.4.tag 5932714 total -time 18:55 result: …… 22 gc_202004111855/gc.187.tag 22 gc_202004111855/gc.331.tag 24 gc_202004111855/gc.163.tag 24 gc_202004111855/gc.83.tag 194 gc_202004111855/gc.9.tag 208 gc_202004111855/gc.24.tag 21994 gc_202004111855/gc.13.tag 42966 gc_202004111855/gc.26.tag 71796 gc_202004111855/gc.25.tag 85788 gc_202004111855/gc.6.tag 104522 gc_202004111855/gc.7.tag 105458 gc_202004111855/gc.10.tag 106126 gc_202004111855/gc.3.tag 126872 gc_202004111855/gc.31.tag 127378 gc_202004111855/gc.23.tag 147954 gc_202004111855/gc.27.tag 148066 gc_202004111855/gc.15.tag 167136 gc_202004111855/gc.28.tag 167940 gc_202004111855/gc.21.tag 187996 gc_202004111855/gc.5.tag 188326 gc_202004111855/gc.22.tag 209092 gc_202004111855/gc.30.tag 209178 gc_202004111855/gc.18.tag 209720 gc_202004111855/gc.19.tag 231126 gc_202004111855/gc.8.tag 249646 gc_202004111855/gc.14.tag 251106 gc_202004111855/gc.2.tag 251378 gc_202004111855/gc.12.tag 251816 gc_202004111855/gc.0.tag 252176 gc_202004111855/gc.11.tag 272134 gc_202004111855/gc.1.tag 291528 gc_202004111855/gc.20.tag 293662 gc_202004111855/gc.16.tag 313010 gc_202004111855/gc.17.tag 352996 gc_202004111855/gc.29.tag 488250 gc_202004111855/gc.4.tag 5931710 total -time 22:54 result: …… 32 gc_202004112254/gc.191.tag 32 gc_202004112254/gc.196.tag 32 gc_202004112254/gc.486.tag 34 gc_202004112254/gc.382.tag 208 gc_202004112254/gc.9.tag 210 gc_202004112254/gc.24.tag 22012 gc_202004112254/gc.13.tag 43008 gc_202004112254/gc.26.tag 71828 gc_202004112254/gc.25.tag 85814 gc_202004112254/gc.6.tag 104544 gc_202004112254/gc.7.tag 105468 gc_202004112254/gc.10.tag 106146 gc_202004112254/gc.3.tag 126882 gc_202004112254/gc.31.tag 127396 gc_202004112254/gc.23.tag 147980 gc_202004112254/gc.27.tag 148088 gc_202004112254/gc.15.tag 167146 gc_202004112254/gc.28.tag 167960 gc_202004112254/gc.21.tag 188006 gc_202004112254/gc.5.tag 188352 gc_202004112254/gc.22.tag 209114 gc_202004112254/gc.30.tag 209204 gc_202004112254/gc.18.tag 209758 gc_202004112254/gc.19.tag 231148 gc_202004112254/gc.8.tag 249668 gc_202004112254/gc.14.tag 251126 gc_202004112254/gc.2.tag 251406 gc_202004112254/gc.12.tag 251850 gc_202004112254/gc.0.tag 252186 gc_202004112254/gc.11.tag 272154 gc_202004112254/gc.1.tag 291552 gc_202004112254/gc.20.tag 293684 gc_202004112254/gc.16.tag 313038 gc_202004112254/gc.17.tag 353014 gc_202004112254/gc.29.tag 488274 gc_202004112254/gc.4.tag 5935738 total it seems gc.0-gc.31 keeps on increasing. but gc.32 -gc.511 works fine. maybe some objects in gc.0-gc.31 stuck gc process, so gc thread on gc.0 - gc.31 cannot proceed after setting debug_rgw=20 , we find only a few process period (every 10 minutes as set) can log like '2020-04-11 23:02:18.724 7f7d1e61b700 5 gc::process: removing default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.985987.4__multipart_playback/2009550000004491/6b97a8a61424129d996d7a13769e8c66.ts.2~KnoW62bPMfmWBWtH0THWWzV3urFJglW.1' most of process period only shows: '2020-04-11 23:32:14.814 7f7d1de1a700 2 object expiration: start 2020-04-11 23:32:14.814 7f7d1de1a700 20 proceeding shard = obj_delete_at_hint.0000000000 2020-04-11 23:32:14.815 7f7d1de1a700 20 proceeding shard = obj_delete_at_hint.0000000001 …… 2020-04-11 23:22:14.811 7f7d1de1a700 20 proceeding shard = obj_delete_at_hint.0000000126 2020-04-11 23:22:14.813 7f7d1de1a700 2 object expiration: stop 2020-04-11 23:22:21.364 7f7d1e61b700 10 RGWGC::process() failed to acquire lock on gc.4 2020-04-11 23:22:31.074 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:22:53.074 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:23:15.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:23:37.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:23:59.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:24:21.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:24:43.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:25:05.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:25:27.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:25:49.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:26:11.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:26:33.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:26:55.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:27:17.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:27:39.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:28:01.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:28:23.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:28:45.075 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:29:07.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:29:29.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:29:51.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:30:13.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:30:35.077 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:30:57.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:31:19.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:31:41.076 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:32:03.077 7f7d1f61d700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-11 23:32:14.814 7f7d1de1a700 2 object expiration: start 2020-04-11 23:32:14.814 7f7d1de1a700 20 proceeding shard = obj_delete_at_hint.0000000000 2020-04-11 23:32:14.815 7f7d1de1a700 20 proceeding shard = obj_delete_at_hint.0000000001 ……' and sometimes gc removes a few objects like: '2020-04-10 12:01:45.141 7f2acdd38700 2 object expiration: stop 2020-04-10 12:01:45.886 7f2ace539700 10 RGWGC::process() failed to acquire lock on gc.8 2020-04-10 12:01:45.911 7f2ace539700 5 gc::process: removing default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.985987.4__multipart_playback/2009510000009744/8d5cb6722d478fd2395af1bc25b73880.ts.2~rDnSwcUmF-MKFwneDyMQRiI4tXuw6K7.1 2020-04-10 12:01:45.911 7f2ace539700 5 gc::process: removing default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.985987.4__multipart_playback/2009510000009744/8d5cb6722d478fd2395af1bc25b73880.ts.2~astXMDje7aBnOLzHG7qIdofTrpHTsUB.1 2020-04-10 12:01:45.911 7f2ace539700 5 gc::process: removing default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.985987.4__multipart_playback/2009550000003154/39abc9fe1bf0f289c6e02132cbc2d96f.ts.2~YW4Uv7qhbKqZoru7wIGvJXZAuytD7s6.1 2020-04-10 12:01:45.911 7f2ace539700 5 gc::process: removing default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.985987.4__multipart_playback/2009550000004519/296b0c23da59ff9eba8a522fc5febd11.ts.2~bL7CZCB-mKP7w2RHFomEsgBzAH1rZt9.1 2020-04-10 12:01:45.912 7f2ace539700 5 gc::process: removing default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.2190746.7__shadow_.VYxHLF5VGGXMFsYy-CNiMOfDVI-NTfQ_1 2020-04-10 12:01:45.912 7f2ace539700 5 gc::process: removing default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.2190746.7__shadow_.VYxHLF5VGGXMFsYy-CNiMOfDVI-NTfQ_2 2020-04-10 12:01:45.912 7f2ace539700 5 gc::process: removing default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.2190746.7__shadow_.VYxHLF5VGGXMFsYy-CNiMOfDVI-NTfQ_3 2020-04-10 12:01:45.912 7f2ace539700 5 gc::process: removing default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.2190746.7__shadow_.VYxHLF5VGGXMFsYy-CNiMOfDVI-NTfQ_4 2020-04-10 12:01:45.912 7f2ace539700 5 gc::process: removing default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.2190746.7__shadow_.VYxHLF5VGGXMFsYy-CNiMOfDVI-NTfQ_5 2020-04-10 12:01:45.914 7f2ace539700 10 RGWGC::process() failed to acquire lock on gc.10 2020-04-10 12:01:45.915 7f2ace539700 10 RGWGC::process() failed to acquire lock on gc.11 2020-04-10 12:02:02.638 7f2acf53b700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-10 12:02:24.638 7f2acf53b700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-10 12:02:46.638 7f2acf53b700 2 RGWDataChangesLog::ChangesRenewThread: start 2020-04-10 12:03:08.638 7f2acf53b700 2 RGWDataChangesLog::ChangesRenewThread: start ……' we scan those logs on 5 gc node( no other workload like object read/write or lifecycle, only for gc). all is the same. the "removing xxx" log appears only a few times within a few seconds in whole day's work so we suppose some objs in gc.0-gc.31 stuck gc process mannual ' radosgw-admin gc process ' not work , log is the same as above Has anyone encountered the same problem? what can i do to make gc thread proceed on gc.0-gc.31 or get everything back to normal _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx