Re: [EXTERNAL] Re: Massive OMAP remediation

"Ben.Zieglmeier" <Ben.Zieglmeier@xxxxxxxxxx> · Thu, 27 Apr 2023 19:58:49 +0000

Hi Dan,

Thanks for the response. No I have not yet told the OSDs participating in that PG to compact. It was something I had thought about, but was somewhat concerned about what that might do, or what performance impact that might have (or if the OSD would come out alive on the other side). I think we may have found a less impactful way to trim these bilog entries by using `--start-marker` and `--end-marker` and simply looping and incrementing those marker values by 1000 each time. This is far less impactful than running the commands without those flags: it was taking ~45 seconds each time to enumerate bilog entries to trim in which the lead OSD was nearly unresponsive. It took diving into the source code and the help of a few colleagues (as well as some trial and error on non-production systems) to figure out what values those arguments actually wanted. Thankfully I was able to get a listing of all OMAP keys for that object a couple weeks ago. I’m still not sure how comfortable I would be doing this to a bucket that was actually mission critical (this one contains non-critical data), but I think we may have a way forward to dislodge this large OMAP by trimming. Thanks again!

-Ben

From: Dan van der Ster <dan.vanderster@xxxxxxxxx>
Date: Wednesday, April 26, 2023 at 11:11 AM
To: Ben.Zieglmeier <Ben.Zieglmeier@xxxxxxxxxx>
Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: [EXTERNAL] Re:  Massive OMAP remediation
Hi Ben,

Are you compacting the relevant osds periodically? ceph tell osd.x
compact (for the three osds holding the bilog) would help reshape the
rocksdb levels to least perform better for a little while until the
next round of bilog trims.

Otherwise, I have experience deleting ~50M object indices in one step
in the past, probably back in the luminous days IIRC. It will likely
lockup the relevant osds for a while while the omap is removed. If you
dare take that step, it might help to set nodown; that might prevent
other osds from flapping and creating more work.

Cheers, Dan

______________________________
Clyso GmbH | https://urldefense.com/v3/__https://www.clyso.com__;!!A-7_uaOk87I!rAkZvWTiVOMlLhgs9UYh_GnFo0_SjmhHU9yBCmioZveHqD0td7g4PbmBewq_wjdaruksI1fcreeet106f6GIfmCrx5f7$<https://urldefense.com/v3/__https:/www.clyso.com__;!!A-7_uaOk87I!rAkZvWTiVOMlLhgs9UYh_GnFo0_SjmhHU9yBCmioZveHqD0td7g4PbmBewq_wjdaruksI1fcreeet106f6GIfmCrx5f7$>

On Tue, Apr 25, 2023 at 2:45 PM Ben.Zieglmeier
<Ben.Zieglmeier@xxxxxxxxxx> wrote:
>
> Hi All,
>
> We have a RGW cluster running Luminous (12.2.11) that has one object with an extremely large OMAP database in the index pool. Listomapkeys on the object returned 390 Million keys to start. Through bilog trim commands, we’ve whittled that down to about 360 Million. This is a bucket index for a regrettably unsharded bucket. There are only about 37K objects actually in the bucket, but through years of neglect, the bilog grown completely out of control. We’ve hit some major problems trying to deal with this particular OMAP object. We just crashed 4 OSDs when a bilog trim caused enough churn to knock one of the OSDs housing this PG out of the cluster temporarily. The OSD disks are 6.4TB NVMe, but are split into 4 partitions, each housing their own OSD daemon (collocated journal).
>
> We want to be rid of this large OMAP object, but are running out of options to deal with it. Reshard outright does not seem like a viable option, as we believe the deletion would deadlock OSDs can could cause impact. Continuing to run `bilog trim` 1000 records at a time has been what we’ve done, but this also seems to be creating impacts to performance/stability. We are seeking options to remove this problematic object without creating additional problems. It is quite likely this bucket is abandoned, so we could remove the data, but I fear even the deletion of such a large OMAP could bring OSDs down and cause potential for metadata loss (the other bucket indexes on that same PG).
>
> Any insight available would be highly appreciated.
>
> Thanks.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx