Re: radosgw garbage collection seems stuck and mannual gc process didn't work

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



An issue presenting exactly like this was fixed in spring of last year, for
certain on nautilus and higher.

Matt

On Sat, Apr 11, 2020, 12:04 PM <346415320@xxxxxx> wrote:

> Ceph Version : Mimic 13.2.4
>
> The cluster has been running steadily for more than a year, recently I
> found cluster usage grows faster than usual .And we figured out the problem
> is garbage collection.
> 'radosgw-admin gc list '  has millions of objects to gc. the earliest tag
> time is 2019-09 , but 99% of them are from 2020-03 to now
>
> `ceph df`
> GLOBAL:
>     SIZE        AVAIL       RAW USED     %RAW USED
>     1.7 PiB     1.1 PiB      602 TiB         35.22
> POOLS:
>     NAME                           ID     USED        %USED     MAX AVAIL
>    OBJECTS
>     .rgw.root                      10     1.2 KiB         0       421 TiB
>            4
>     default.rgw.control            11         0 B         0       421 TiB
>            8
>     default.rgw.data.root          12         0 B         0       421 TiB
>            0
>     default.rgw.gc                 13         0 B         0       421 TiB
>            0
>     default.rgw.log                14     4.8 GiB         0       421 TiB
>         6414
>     default.rgw.intent-log         15         0 B         0       421 TiB
>            0
>     default.rgw.meta               16     110 KiB         0       421 TiB
>          463
>     default.rgw.usage              17         0 B         0       421 TiB
>            0
>     default.rgw.users.keys         18         0 B         0       421 TiB
>            0
>     default.rgw.users.email        19         0 B         0       421 TiB
>            0
>     default.rgw.users.swift        20         0 B         0       421 TiB
>            0
>     default.rgw.users.uid          21         0 B         0       421 TiB
>            0
>     default.rgw.buckets.extra      22         0 B         0       421 TiB
>            0
>     default.rgw.buckets.index      23         0 B         0       421 TiB
>       118720
>     default.rgw.buckets.data       24     263 TiB     38.41       421 TiB
>    138902771
>     default.rgw.buckets.non-ec     25         0 B         0       421 TiB
>        16678
>
> however we counted each bucket usage by  ' radosgw-admin bucket stats '
> ,it should cost only 160TiB , about 80TiB are in GC list
>
> former gc config setting before we find gc problem:
> rgw_gc_max_objs = 32
> rgw_gc_obj_min_wait = 3600
> rgw_gc_processor_period = 3600
> rgw_gc_processor_max_time = 3600
>
> yesterday we adjust our setting and restart rgw:
> rgw_gc_max_objs = 1024
> rgw_gc_obj_min_wait = 300
> rgw_gc_processor_period = 600
> rgw_gc_processor_max_time = 600
> rgw_gc_max_concurrent_io = 40
> rgw_gc_max_trim_chunk = 1024
>
> today  we use  ' rados  -p default.rgw.log listomapkeys gc.$i --cluster
> ceph -N gc | wc -l '   (i from 0 to 1023)
> well , only gc.0 to gc.511 has data
>
> here are some result sorted
>  -time 14:43 result:
>        ……
>        36 gc_202004111443/gc.502.tag
>        38 gc_202004111443/gc.501.tag
>        40 gc_202004111443/gc.136.tag
>        46 gc_202004111443/gc.511.tag
>       212 gc_202004111443/gc.9.tag
>       218 gc_202004111443/gc.24.tag
>     21976 gc_202004111443/gc.13.tag
>     42956 gc_202004111443/gc.26.tag
>     71772 gc_202004111443/gc.25.tag
>     85766 gc_202004111443/gc.6.tag
>    104504 gc_202004111443/gc.7.tag
>    105444 gc_202004111443/gc.10.tag
>    106114 gc_202004111443/gc.3.tag
>    126860 gc_202004111443/gc.31.tag
>    127352 gc_202004111443/gc.23.tag
>    147942 gc_202004111443/gc.27.tag
>    148046 gc_202004111443/gc.15.tag
>    167116 gc_202004111443/gc.28.tag
>    167932 gc_202004111443/gc.21.tag
>    187986 gc_202004111443/gc.5.tag
>    188312 gc_202004111443/gc.22.tag
>    209084 gc_202004111443/gc.30.tag
>    209152 gc_202004111443/gc.18.tag
>    209702 gc_202004111443/gc.19.tag
>    231100 gc_202004111443/gc.8.tag
>    249622 gc_202004111443/gc.14.tag
>    251092 gc_202004111443/gc.2.tag
>    251366 gc_202004111443/gc.12.tag
>    251802 gc_202004111443/gc.0.tag
>    252158 gc_202004111443/gc.11.tag
>    272114 gc_202004111443/gc.1.tag
>    291518 gc_202004111443/gc.20.tag
>    293646 gc_202004111443/gc.16.tag
>    312998 gc_202004111443/gc.17.tag
>    352984 gc_202004111443/gc.29.tag
>    488232 gc_202004111443/gc.4.tag
>   5935806 total
>
>
>  -time 16:53 result:
> ……
>        28 gc_202004111653/gc.324.tag
>        28 gc_202004111653/gc.414.tag
>        30 gc_202004111653/gc.350.tag
>        30 gc_202004111653/gc.456.tag
>       204 gc_202004111653/gc.9.tag
>       208 gc_202004111653/gc.24.tag
>     21986 gc_202004111653/gc.13.tag
>     42964 gc_202004111653/gc.26.tag
>     71780 gc_202004111653/gc.25.tag
>     85778 gc_202004111653/gc.6.tag
>    104512 gc_202004111653/gc.7.tag
>    105452 gc_202004111653/gc.10.tag
>    106122 gc_202004111653/gc.3.tag
>    126866 gc_202004111653/gc.31.tag
>    127372 gc_202004111653/gc.23.tag
>    147944 gc_202004111653/gc.27.tag
>    148058 gc_202004111653/gc.15.tag
>    167124 gc_202004111653/gc.28.tag
>    167936 gc_202004111653/gc.21.tag
>    187992 gc_202004111653/gc.5.tag
>    188320 gc_202004111653/gc.22.tag
>    209090 gc_202004111653/gc.30.tag
>    209170 gc_202004111653/gc.18.tag
>    209704 gc_202004111653/gc.19.tag
>    231108 gc_202004111653/gc.8.tag
>    249632 gc_202004111653/gc.14.tag
>    251096 gc_202004111653/gc.2.tag
>    251376 gc_202004111653/gc.12.tag
>    251806 gc_202004111653/gc.0.tag
>    252170 gc_202004111653/gc.11.tag
>    272118 gc_202004111653/gc.1.tag
>    291526 gc_202004111653/gc.20.tag
>    293656 gc_202004111653/gc.16.tag
>    313000 gc_202004111653/gc.17.tag
>    352988 gc_202004111653/gc.29.tag
>    488238 gc_202004111653/gc.4.tag
>   5932714 total
>
>
> -time 18:55 result:
> ……
>        22 gc_202004111855/gc.187.tag
>        22 gc_202004111855/gc.331.tag
>        24 gc_202004111855/gc.163.tag
>        24 gc_202004111855/gc.83.tag
>       194 gc_202004111855/gc.9.tag
>       208 gc_202004111855/gc.24.tag
>     21994 gc_202004111855/gc.13.tag
>     42966 gc_202004111855/gc.26.tag
>     71796 gc_202004111855/gc.25.tag
>     85788 gc_202004111855/gc.6.tag
>    104522 gc_202004111855/gc.7.tag
>    105458 gc_202004111855/gc.10.tag
>    106126 gc_202004111855/gc.3.tag
>    126872 gc_202004111855/gc.31.tag
>    127378 gc_202004111855/gc.23.tag
>    147954 gc_202004111855/gc.27.tag
>    148066 gc_202004111855/gc.15.tag
>    167136 gc_202004111855/gc.28.tag
>    167940 gc_202004111855/gc.21.tag
>    187996 gc_202004111855/gc.5.tag
>    188326 gc_202004111855/gc.22.tag
>    209092 gc_202004111855/gc.30.tag
>    209178 gc_202004111855/gc.18.tag
>    209720 gc_202004111855/gc.19.tag
>    231126 gc_202004111855/gc.8.tag
>    249646 gc_202004111855/gc.14.tag
>    251106 gc_202004111855/gc.2.tag
>    251378 gc_202004111855/gc.12.tag
>    251816 gc_202004111855/gc.0.tag
>    252176 gc_202004111855/gc.11.tag
>    272134 gc_202004111855/gc.1.tag
>    291528 gc_202004111855/gc.20.tag
>    293662 gc_202004111855/gc.16.tag
>    313010 gc_202004111855/gc.17.tag
>    352996 gc_202004111855/gc.29.tag
>    488250 gc_202004111855/gc.4.tag
>   5931710 total
>
> -time 22:54 result:
> ……
>        32 gc_202004112254/gc.191.tag
>        32 gc_202004112254/gc.196.tag
>        32 gc_202004112254/gc.486.tag
>        34 gc_202004112254/gc.382.tag
>       208 gc_202004112254/gc.9.tag
>       210 gc_202004112254/gc.24.tag
>     22012 gc_202004112254/gc.13.tag
>     43008 gc_202004112254/gc.26.tag
>     71828 gc_202004112254/gc.25.tag
>     85814 gc_202004112254/gc.6.tag
>    104544 gc_202004112254/gc.7.tag
>    105468 gc_202004112254/gc.10.tag
>    106146 gc_202004112254/gc.3.tag
>    126882 gc_202004112254/gc.31.tag
>    127396 gc_202004112254/gc.23.tag
>    147980 gc_202004112254/gc.27.tag
>    148088 gc_202004112254/gc.15.tag
>    167146 gc_202004112254/gc.28.tag
>    167960 gc_202004112254/gc.21.tag
>    188006 gc_202004112254/gc.5.tag
>    188352 gc_202004112254/gc.22.tag
>    209114 gc_202004112254/gc.30.tag
>    209204 gc_202004112254/gc.18.tag
>    209758 gc_202004112254/gc.19.tag
>    231148 gc_202004112254/gc.8.tag
>    249668 gc_202004112254/gc.14.tag
>    251126 gc_202004112254/gc.2.tag
>    251406 gc_202004112254/gc.12.tag
>    251850 gc_202004112254/gc.0.tag
>    252186 gc_202004112254/gc.11.tag
>    272154 gc_202004112254/gc.1.tag
>    291552 gc_202004112254/gc.20.tag
>    293684 gc_202004112254/gc.16.tag
>    313038 gc_202004112254/gc.17.tag
>    353014 gc_202004112254/gc.29.tag
>    488274 gc_202004112254/gc.4.tag
>   5935738 total
>
>
> it seems gc.0-gc.31 keeps on increasing. but gc.32 -gc.511 works fine.
> maybe some objects in gc.0-gc.31 stuck gc process, so gc thread on gc.0 -
> gc.31 cannot proceed
>
> after setting debug_rgw=20 , we find only a few process period (every 10
> minutes as set) can log like
> '2020-04-11 23:02:18.724 7f7d1e61b700  5 gc::process: removing
> default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.985987.4__multipart_playback/2009550000004491/6b97a8a61424129d996d7a13769e8c66.ts.2~KnoW62bPMfmWBWtH0THWWzV3urFJglW.1'
>
> most of process period only shows:
> '2020-04-11 23:32:14.814 7f7d1de1a700  2 object expiration: start
> 2020-04-11 23:32:14.814 7f7d1de1a700 20 proceeding shard =
> obj_delete_at_hint.0000000000
> 2020-04-11 23:32:14.815 7f7d1de1a700 20 proceeding shard =
> obj_delete_at_hint.0000000001
> ……
> 2020-04-11 23:22:14.811 7f7d1de1a700 20 proceeding shard =
> obj_delete_at_hint.0000000126
> 2020-04-11 23:22:14.813 7f7d1de1a700  2 object expiration: stop
> 2020-04-11 23:22:21.364 7f7d1e61b700 10 RGWGC::process() failed to acquire
> lock on gc.4
> 2020-04-11 23:22:31.074 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:22:53.074 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:23:15.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:23:37.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:23:59.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:24:21.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:24:43.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:25:05.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:25:27.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:25:49.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:26:11.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:26:33.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:26:55.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:27:17.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:27:39.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:28:01.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:28:23.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:28:45.075 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:29:07.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:29:29.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:29:51.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:30:13.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:30:35.077 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:30:57.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:31:19.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:31:41.076 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:32:03.077 7f7d1f61d700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-11 23:32:14.814 7f7d1de1a700  2 object expiration: start
> 2020-04-11 23:32:14.814 7f7d1de1a700 20 proceeding shard =
> obj_delete_at_hint.0000000000
> 2020-04-11 23:32:14.815 7f7d1de1a700 20 proceeding shard =
> obj_delete_at_hint.0000000001
> ……'
>
> and sometimes gc removes a few objects like:
>
> '2020-04-10 12:01:45.141 7f2acdd38700  2 object expiration: stop
> 2020-04-10 12:01:45.886 7f2ace539700 10 RGWGC::process() failed to acquire
> lock on gc.8
> 2020-04-10 12:01:45.911 7f2ace539700  5 gc::process: removing
> default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.985987.4__multipart_playback/2009510000009744/8d5cb6722d478fd2395af1bc25b73880.ts.2~rDnSwcUmF-MKFwneDyMQRiI4tXuw6K7.1
> 2020-04-10 12:01:45.911 7f2ace539700  5 gc::process: removing
> default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.985987.4__multipart_playback/2009510000009744/8d5cb6722d478fd2395af1bc25b73880.ts.2~astXMDje7aBnOLzHG7qIdofTrpHTsUB.1
> 2020-04-10 12:01:45.911 7f2ace539700  5 gc::process: removing
> default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.985987.4__multipart_playback/2009550000003154/39abc9fe1bf0f289c6e02132cbc2d96f.ts.2~YW4Uv7qhbKqZoru7wIGvJXZAuytD7s6.1
> 2020-04-10 12:01:45.911 7f2ace539700  5 gc::process: removing
> default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.985987.4__multipart_playback/2009550000004519/296b0c23da59ff9eba8a522fc5febd11.ts.2~bL7CZCB-mKP7w2RHFomEsgBzAH1rZt9.1
> 2020-04-10 12:01:45.912 7f2ace539700  5 gc::process: removing
> default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.2190746.7__shadow_.VYxHLF5VGGXMFsYy-CNiMOfDVI-NTfQ_1
> 2020-04-10 12:01:45.912 7f2ace539700  5 gc::process: removing
> default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.2190746.7__shadow_.VYxHLF5VGGXMFsYy-CNiMOfDVI-NTfQ_2
> 2020-04-10 12:01:45.912 7f2ace539700  5 gc::process: removing
> default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.2190746.7__shadow_.VYxHLF5VGGXMFsYy-CNiMOfDVI-NTfQ_3
> 2020-04-10 12:01:45.912 7f2ace539700  5 gc::process: removing
> default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.2190746.7__shadow_.VYxHLF5VGGXMFsYy-CNiMOfDVI-NTfQ_4
> 2020-04-10 12:01:45.912 7f2ace539700  5 gc::process: removing
> default.rgw.buckets.data:d9d41307-13ce-4b43-aa24-2495540b67f5.2190746.7__shadow_.VYxHLF5VGGXMFsYy-CNiMOfDVI-NTfQ_5
> 2020-04-10 12:01:45.914 7f2ace539700 10 RGWGC::process() failed to acquire
> lock on gc.10
> 2020-04-10 12:01:45.915 7f2ace539700 10 RGWGC::process() failed to acquire
> lock on gc.11
> 2020-04-10 12:02:02.638 7f2acf53b700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-10 12:02:24.638 7f2acf53b700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-10 12:02:46.638 7f2acf53b700  2
> RGWDataChangesLog::ChangesRenewThread: start
> 2020-04-10 12:03:08.638 7f2acf53b700  2
> RGWDataChangesLog::ChangesRenewThread: start
> ……'
>
> we scan those logs on 5 gc node( no other workload like object read/write
> or lifecycle, only for gc). all is the same. the "removing xxx" log appears
> only a few times within a few seconds in whole day's work
> so we suppose some objs in gc.0-gc.31 stuck gc process
> mannual ' radosgw-admin gc process ' not work , log is the same as above
>
> Has anyone encountered the same problem?
> what can i do to make gc thread proceed on gc.0-gc.31 or get everything
> back to normal
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux