Re: octopus garbage collector makes slow ops

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 26 Jul 2021 10:35:24 -0500

Yeah, I suspect that regular manual compaction might be the necessary 
work around here if tombstones are slowing down iterator performance.  
If it is related to tombstones, it would be similar to what we saw when 
we tried to use deleterange and saw similar performance issues.

I'm a little at a lose as to why nautilus was better (other than the 
ill-fated bluefs_buffered_io change).  There has  been a fair amount of 
code churn both in Ceph but also in rocksdb related to some of this 
though.  Pacific is definitely more likely to get backports for this 
kind of thing IMHO.

Mark

On 7/26/21 6:19 AM, Igor Fedotov wrote:
Unfortunately I'm not an expert in RGW hence nothing to recommend from 
that side.

Apparently your issues are caused by bulk data removal - it appears 
that RocksDB can hardly sustain such things and its performance 
degrades. We've seen that plenty of times before.

So far there are two known workarounds - manual DB compaction with 
using ceph-kvstore-tool and setting bluefs_buffer_io to true. The 
latter makes sense for some Ceph releases which got that parameter set 
to false by default, v15.2.12 is one of them. And indeed that setting 
might cause high RAM usage in cases - you might want to look for 
relevant recent PRs at github or ask Mark Nelson from RH for more 
details.

Nevertheless current upstream recommendation/default is to have it set 
to true as it greatly improves DB performance.

So you might want to try to compact RocksDB as per above but please 
note that's a temporary workaround - DB might start to degrade if 
removals are going on.

There is also a PR to address the bulk removal issue in general:

1) https://github.com/ceph/ceph/pull/37496 (still pending review and 
unlikely to be backported to Octopus).

One more question - do your HDD OSDs  have additional fast (SSD/NVMe) 
drives for DB volumes? Or their DBs reside as spinning drives only? If 
the latter is true I would strongly encourage you to fix that by 
adding respective fast disks - RocksDB tend to works badly when not 
deployed on SSDs...

Thanks,

Igor

On 7/26/2021 1:28 AM, mahnoosh shahidi wrote:
Hi Igor,
Thanks for your response.This problem happens on my osds with hdd 
disks. I set the bluefs_buffered_io to true just for these osds but 
it caused my bucket index disks (which are ssd) to produce slow ops. 
I also tried to set bluefs_buffered_io to true in bucket index osds 
but they filled the entire memory (256G) so I had to set the 
bluefs_buffered_io back to false in all osds. Is that the only way to 
handle the garbage collector problem? Do you have any ideas for the 
bucket index problem?

On Thu, Jul 22, 2021 at 3:37 AM Igor Fedotov <ifedotov@xxxxxxx 
<mailto:ifedotov@xxxxxxx>> wrote:

    Hi Mahnoosh,

    you might want to set bluefs_buffered_io to true for every OSD.

    It looks it's false by default in v15.2.12

    Thanks,

    Igor

    On 7/18/2021 11:19 PM, mahnoosh shahidi wrote:
    > We have a ceph cluster with 408 osds, 3 mons and 3 rgws. We
    updated our
    > cluster from nautilus 14.2.14 to octopus 15.2.12 a few days ago.
    After
    > upgrading, the garbage collector process which is run after the
    lifecycle
    > process, causes slow ops and makes some osds to be restarted. In
    each
    > process the garbage collector deletes about 1 million objects.
    Below are
    > the one of the osd's logs before it restarts.
    >
    > ```
    > 2021-07-18T00:44:38.807+0430 7fd1cda76700  1 osd.60 1092400
    is_healthy
    > false -- internal heartbeat failed
    > 2021-07-18T00:44:38.807+0430 7fd1cda76700  1 osd.60 1092400 not
    > healthy; waiting to boot
    > 2021-07-18T00:44:39.847+0430 7fd1cda76700  1 heartbeat_map
    is_healthy
    > 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
    > 2021-07-18T00:44:39.847+0430 7fd1cda76700  1 osd.60 1092400
    is_healthy
    > false -- internal heartbeat failed
    > 2021-07-18T00:44:39.847+0430 7fd1cda76700  1 osd.60 1092400 not
    > healthy; waiting to boot
    > 2021-07-18T00:44:40.895+0430 7fd1cda76700  1 heartbeat_map
    is_healthy
    > 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
    > 2021-07-18T00:44:40.895+0430 7fd1cda76700  1 osd.60 1092400
    is_healthy
    > false -- internal heartbeat failed
    > 2021-07-18T00:44:40.895+0430 7fd1cda76700  1 osd.60 1092400 not
    > healthy; waiting to boot
    > 2021-07-18T00:44:41.859+0430 7fd1cda76700  1 heartbeat_map
    is_healthy
    > 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
    > 2021-07-18T00:44:41.859+0430 7fd1cda76700  1 osd.60 1092400
    is_healthy
    > false -- internal heartbeat failed
    > 2021-07-18T00:44:41.859+0430 7fd1cda76700  1 osd.60 1092400 not
    > healthy; waiting to boot
    > 2021-07-18T00:44:42.811+0430 7fd1cda76700  1 heartbeat_map
    is_healthy
    > 'OSD::osd_op_tp thread 0x7fd1b4243700' had timed out after 15
    > 2021-07-18T00:44:42.811+0430 7fd1cda76700  1 osd.60 1092400
    is_healthy
    > false -- internal heartbeat failed
    >
    > ```
    > what is the suitable configuration for gc in such a heavy delete
    process so
    > it doesn't make slow ops? We had the same delete load in
    nautilus but we
    > didn't have any problem with that.
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx