Re: octopus (15.2.16) OSDs crash or don't answer heathbeats (and get marked as down)

Boris Behrens <bb@xxxxxxxxx> · Wed, 14 Dec 2022 01:23:59 +0100

You could try to do this in a screen session for a while.
while true; do radosgw-admin gc process; done

Maybe your normal RGW daemons are too busy for GC processing.
We have this in our config and have started extra RGW instances for GC only:
[global]
...
# disable garbage collector default
rgw_enable_gc_threads = false
[client.gc-host1]
rgw_frontends = "beast endpoint=[::1]:7489"
rgw_enable_gc_threads = true

Am Mi., 14. Dez. 2022 um 01:14 Uhr schrieb Jakub Jaszewski
<jaszewski.jakub@xxxxxxxxx>:
>
> Hi Boris, many thanks for the link!
>
> I see that GC list keep growing on my cluster and there are some very big multipart objects on the GC list, even 138660 parts that I calculate as >500GB in size.
> These objects are visible on the GC list but not on rados-level when calling radosgw-admin --bucket=bucket_name bucket radoslist
> Also I manually called GC process,  radosgw-admin gc process --bucket=bucket_name --debug-rgw=20   which according to logs did the job (no errors raised although objects do not exist in rados?)
> ...
> 2022-12-13T20:21:06.635+0100 7fe0eb771080 20 garbage collection: RGWGC::process iterating over entry tag='2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD', time=2022-12-13T12:35:59.727067+0100, chain.objs.size()=138660
> 2022-12-13T20:21:06.635+0100 7fe0eb771080  5 garbage collection: RGWGC::process removing default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__multipart_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1
> 2022-12-13T20:21:06.703+0100 7fe0eb771080  5 garbage collection: RGWGC::process removing default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_1
> 2022-12-13T20:21:06.859+0100 7fe0eb771080  5 garbage collection: RGWGC::process removing default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_2
> ...
> but GC queue did not reduce, objects are still on the GC list.
>
> Do you happen to know how to remove non existent RADOS objects from RGW GC list ?
>
> One more thing i have to check is max_secs=3600 for GC when entering particular index_shard. As you can see in the logs, processing of multiparted objects takes more than 3600 seconds.  I will try to increase rgw_gc_processor_max_time
>
> 2022-12-13T20:20:13.168+0100 7fe0eb771080 20 garbage collection: RGWGC::process entered with GC index_shard=25, max_secs=3600, expired_only=1
> 2022-12-13T20:20:13.168+0100 7fe0eb771080 20 garbage collection: RGWGC::process cls_rgw_gc_list returned with returned:0, entries.size=0, truncated=0, next_marker=''
> 2022-12-13T20:20:13.172+0100 7fe0eb771080 20 garbage collection: RGWGC::process cls_rgw_gc_list returned NO non expired entries, so setting cache entry to TRUE
> 2022-12-13T20:20:27.748+0100 7fe02ffff700  2 RGWDataChangesLog::ChangesRenewThread: start
> 2022-12-13T20:20:49.748+0100 7fe02ffff700  2 RGWDataChangesLog::ChangesRenewThread: start
> ...
> 2022-12-13T20:21:05.339+0100 7fe0eb771080 20 garbage collection: RGWGC::process cls_rgw_gc_queue_list_entries returned with return value:0, entries.size=100, truncated=1, next_marker='4/20986990'
> 2022-12-13T20:21:06.635+0100 7fe0eb771080 20 garbage collection: RGWGC::process iterating over entry tag='2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD', time=2022-12-13T12:35:59.727067+0100, chain.objs.size()=138660
> 2022-12-13T20:21:06.635+0100 7fe0eb771080  5 garbage collection: RGWGC::process removing default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__multipart_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1
> 2022-12-13T20:21:06.703+0100 7fe0eb771080  5 garbage collection: RGWGC::process removing default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.1_1
> ...
> 2022-12-13T21:31:23.505+0100 7fe0eb771080  5 garbage collection: RGWGC::process removing default.rgw.buckets.data:b4a09486-4fb6-474a-a45a-3fc6f7778e27.6781345.2__
> shadow_2ib3aonh7thn59a394l06un5i9lu2fhf1r2sl2g6rrhqbqhv6pjg.2~rBQjiZ4SWUf8u9IS1BUXGEwCnFTHqfD.4622_29
> ...
> 2022-12-13T21:31:23.565+0100 7fe0eb771080 20 garbage collection: RGWGC::process entered with GC index_shard=26, max_secs=3600, expired_only=1
> ...
>
> Best Regards
> Jakub
>
> On Wed, Dec 7, 2022 at 6:10 PM Boris <bb@xxxxxxxxx> wrote:
>>
>> Hi Jakub,
>>
>> the problem is in our case that we hit this bug (https://tracker.ceph.com/issues/53585) and the GC leads to this problem.
>>
>> We worked around this, by moving the GC to separate disk.
>> It still runs nuts all the time, but at least it does not bring down the cluster, but now it looks like rados objects go missing.
>>
>> Mit freundlichen Grüßen
>>  - Boris Behrens
>>
>> Am 07.12.2022 um 17:44 schrieb Jakub Jaszewski <jaszewski.jakub@xxxxxxxxx>:
>>
>> 
>> Hi Boris,
>>
>> Did you manage to solve the problem?
>> I have a similar one after upgrade from nautilus 14.2.22 to octopus 15.2.17. My ssd-class OSDs shuts down after being marked down more than 5 times in 10 minutes after the series of
>>
>> 2022-12-04T06:52:30.705+0000 7fb2a8f6c700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fb28aed4700' had timed out after 15
>> 2022-12-04T06:52:43.585+0000 7fb286ecc700  1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7fb286ecc700' had timed out after 15
>>
>> 2022-12-04T06:52:43.597+0000 7fb298cea700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.62 down, but it is still running
>> 2022-12-04T06:52:43.597+0000 7fb298cea700  0 log_channel(cluster) log [DBG] : map e57013 wrongly marked me down at e57013
>> 2022-12-04T06:56:20.388+0000 7fb298cea700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.62 down, but it is still running
>> 2022-12-04T06:56:20.388+0000 7fb298cea700  0 log_channel(cluster) log [DBG] : map e57031 wrongly marked me down at e57030
>> 2022-12-04T07:16:07.323+0000 7fb298cea700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.62 down, but it is still running
>> 2022-12-04T07:16:07.323+0000 7fb298cea700  0 log_channel(cluster) log [DBG] : map e57040 wrongly marked me down at e57039
>> 2022-12-04T07:17:21.955+0000 7fb298cea700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.62 down, but it is still running
>> 2022-12-04T07:17:21.955+0000 7fb298cea700  0 log_channel(cluster) log [DBG] : map e57048 wrongly marked me down at e57048
>> 2022-12-04T07:18:40.723+0000 7fb298cea700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.62 down, but it is still running
>> 2022-12-04T07:18:40.723+0000 7fb298cea700  0 log_channel(cluster) log [DBG] : map e57058 wrongly marked me down at e57057
>> 2022-12-04T07:19:55.463+0000 7fb298cea700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.62 down, but it is still running
>> 2022-12-04T07:19:55.463+0000 7fb298cea700  0 log_channel(cluster) log [DBG] : map e57065 wrongly marked me down at e57065
>> 2022-12-04T07:21:14.030+0000 7fb298cea700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.62 down, but it is still running
>> 2022-12-04T07:21:14.030+0000 7fb298cea700  0 log_channel(cluster) log [DBG] : map e57074 wrongly marked me down at e57073
>> 2022-12-04T07:22:34.078+0000 7fb298cea700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.62 down, but it is still running
>> 2022-12-04T07:22:34.078+0000 7fb298cea700  0 log_channel(cluster) log [DBG] : map e57081 wrongly marked me down at e57080
>> 2022-12-04T07:22:34.078+0000 7fb298cea700 -1 osd.62 57081 _committed_osd_maps marked down 6 > osd_max_markdown_count 5 in last 600.000000 seconds, shutting down
>>
>> at the same time i observe slow ops on ssd-class OSDs, ops look like:
>>
>> osd_op(client.575459444.0:1715813 4.7 4:e622f76a:gc::gc.20:head [call rgw_gc.rgw_gc_queue_list_entries in=31b] snapc 0=[] ondisk+read+known_if_redirected e54655)
>> osd_op(client.575459438.0:1403912 4.7 4:e09767b9:::obj_delete_at_hint.0000000086:head [call lock.lock in=8b] snapc 0=[] ondisk+write+known_if_redirected e54655)
>> osd_op(client.596382956.0:83988009 4.13 4:cea599e4:usage::usage.10:head [call rgw.user_usage_log_read in=94b] snapc 0=[] ondisk+read+known_if_redirected e57109)
>> osd_op(client.596345306.0:14545144 4.13 4:cea599e4:usage::usage.10:head [call rgw.user_usage_log_add in=173b] snapc 0=[] ondisk+write+known_if_redirected e57109)
>> osd_op(client.596345312.0:2369917 4.13 4:cedbb205:::obj_delete_at_hint.0000000017:head [call lock.lock in=50b] snapc 0=[] ondisk+write+known_if_redirected e57109)
>>
>> where pool number 4 is default.rgw.log pool, not sure how to decode remaining part of osd_op
>>
>> Involved SSDs are shared by RGWs pools default.rgw.buckets.index , default.rgw.log and are far from being saturated
>> --- POOLS ---
>> POOL                        ID  PGS   STORED   (DATA)   (OMAP)   OBJECTS  USED     (DATA)   (OMAP)   %USED  MAX AVAIL  QUOTA OBJECTS  QUOTA BYTES  DIRTY  USED COMPR  UNDER COMPR
>> default.rgw.buckets.index    1    32   52 GiB      0 B   52 GiB    6.21k  157 GiB      0 B  157 GiB   0.67    7.6 TiB  N/A            N/A            N/A         0 B          0 B
>> default.rgw.log              4    32  514 MiB   11 MiB  502 MiB      238  1.5 GiB   35 MiB  1.5 GiB      0    7.6 TiB  N/A            N/A            N/A         0 B          0 B
>>
>> After analyzing RGW traffic, it seems to be related to the wave of objects deletion, next wave should occur on Sunday, I will increase debug_bluestore this time, maybe something interesting will be logged.
>> I've also performed manual kvstore compaction and increased osd_op_thread_timeout from 15 to 60 seconds, hope it will help.
>>
>>
>> Best Regards
>> Jakub
>>
>>
>> On Tue, Apr 26, 2022 at 6:55 PM Boris <bb@xxxxxxxxx> wrote:
>>>
>>> I have this problem also with OSDs that are with SSDs as block.db.
>>>
>>> >
>>> > Am 26.04.2022 um 17:10 schrieb Konstantin Shalygin <k0ste@xxxxxxxx>:
>>> >
>>> > Hi,
>>> >
>>> > After some load HDD's will be not perform well. You should move block.db's to NVMe for avoid database vacuuming problems
>>> >
>>> > k
>>> > Sent from my iPhone
>>> >
>>> >> On 26 Apr 2022, at 13:58, Boris Behrens <bb@xxxxxxxxx> wrote:
>>> >>
>>> >> The cluster contains 12x8TB OSDs without any SSDs as cache
>>> >
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend
im groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx