Re: Large omap in index pool even if properly sharded and not "OVER"

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Mon, 15 Jul 2024 15:22:41 +0200 (CEST)

Right. 

This procedure would recatalog any lost S3 objects (out of the 40TB) and would allow you to delete them using S3 afterwards if that's what you want. 
Note that I don't think it handles different versions of S3 objects, if any exist, so you might still end up with orphaned data in the RADOS pool. 

If you no longer have any interest in this bucket, you could simply purge the bucket with all its data, then use 'rados ls' to list any orphan objects whose names begin with the bucket prefix (make sure you saved this information before deleting the bucket), and finally use 'rados rm' to remove them. 

Regards, 
Frédéric. 

----- Le 15 Juil 24, à 5:30, Istvan Szabo, Agoda <Istvan.Szabo@xxxxxxxxx> a écrit : 

> Hi,

> But this not cleaning right? Just restore if lost.

> Istvan Szabo
> Staff Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: [ mailto:istvan.szabo@xxxxxxxxx | istvan.szabo@xxxxxxxxx ]
> ---------------------------------------------------

> From: Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx>
> Sent: Friday, July 12, 2024 6:52 PM
> To: Richard Bade <hitrich@xxxxxxxxx>; Szabo, Istvan (Agoda)
> <Istvan.Szabo@xxxxxxxxx>
> Cc: Casey Bodley <cbodley@xxxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>
> Subject: Re:  Re: Large omap in index pool even if properly sharded
> and not "OVER"
> Email received from the internet. If in doubt, don't click any link nor open any
> attachment !
> ________________________________

> ----- Le 11 Juil 24, à 0:23, Richard Bade hitrich@xxxxxxxxx a écrit :

> > Hi Casey,
> > Thanks for that info on the bilog. I'm in a similar situation with
> > large omap objects and we have also had to reshard buckets on
> > multisite losing the index on the secondary.
> > We also now have a lot of buckets with sync disable so I wanted to
> > check that it's always safe to trim the bilog on buckets with sync
> > disabled?
> > I can see some stale entries with "completed" state and a timestamp of
> > a number of months ago but also some that say pending and have no
> > timestamp.

> > Istvan, I can also possibly help with your orphaned 40TB on the secondary zone.
> > Each object has the bucket marker in its name. If you do a `rados -p
> > {pool_name} ls` and find all the ones that start with the bucket
> > marker (found with `radosgw-admin bucket stats
> > --bucket={bucket_name}`) then you can do one of two things:
> > 1, `rados rm` the object
> > 2, restore the index with info from the object itself
> > - create a dummy index template (use `radosgw-admin bi get` on a
> > known good index to get the structure)
> > - grab the etag from the object xattribs and use this and the name
> > in the template (`rados -p {pool} getxattr {objname} user.rgw.etag`)
> > - use ` radosgw-admin bi put` to create the index
> > - use `radosgw-admin bucket check --check-objects --fix
> > --bucket={bucket_name}` to fix up the bucket object count and object
> > sizes at the end

> One could also use `radosgw-admin object reindex --bucket {bucket_name}` to scan
> the data pool for objects that belong to a given bucket and add those objects
> back to the bucket index.

> Same logic as rgw-restore-bucket-index [1][2] script that has proven to be
> successful in recovering bucket indexes destroyed by resharding [3].

> Regards,
> Frédéric.

> [1] [ https://docs.ceph.com/en/latest/man/8/rgw-restore-bucket-index/ |
> https://docs.ceph.com/en/latest/man/8/rgw-restore-bucket-index/ ]
> [2] [ https://github.com/ceph/ceph/blob/main/src/rgw/rgw-restore-bucket-index |
> https://github.com/ceph/ceph/blob/main/src/rgw/rgw-restore-bucket-index ]
> [3] [ https://github.com/ceph/ceph/pull/50329 |
> https://github.com/ceph/ceph/pull/50329 ]

> > This process takes quite some time and I can't say if it's 100%
> > perfect but it enabled us to get to a state where we could delete the
> > buckets and clean up the objects.
> > I hope this helps.

> > Regards,
> > Richard

> > On Thu, 11 Jul 2024 at 01:25, Casey Bodley <cbodley@xxxxxxxxxx> wrote:

> >> On Tue, Jul 9, 2024 at 12:41 PM Szabo, Istvan (Agoda)
> >> <Istvan.Szabo@xxxxxxxxx> wrote:

> >> > Hi Casey,

> >> > 1.
> >> > Regarding versioning, the user doesn't use verisoning it if I'm not mistaken:
>>> > [
>>> > https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
> >> > |
> https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
> ]

> >> > 2.
> >> > Regarding multiparts, if it would have multipart thrash, it would be listed
> >> > here:
>>> > [
>>> > https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
> >> > |
> https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
> ]
> >> > as a rgw.multimeta under the usage, right?

> >> > 3.
> >> > Regarding the multisite idea, this bucket has been a multisite bucket last year
> >> > but we had to reshard (accepting to loose the replica on the 2nd site and just
> >> > keep it in the master site) and that time as expected it has disappeared
> >> > completely from the 2nd site (I guess the 40TB thrash still there but can't
> >> > really find it how to clean 🙁 ). Now it is a single site bucket.
> >> > Also it is the index pool, multisite logs should go to the rgw.log pool
> >> > shouldn't it?

> >> some replication logs are in the log pool, but the per-object logs are
> >> stored in the bucket index objects. you can inspect these with
> >> `radosgw-admin bilog list --bucket=X`. by default, that will only list
> >> --max-entries=1000. you can add --shard-id=Y to look at specific
> >> 'large omap' objects

> >> even if your single-site bucket doesn't exist on the secondary zone,
> >> changes on the primary zone are probably still generating these bilog
> >> entries. you would need to do something like `radosgw-admin bucket
> >> sync disable --bucket=X` to make it stop. because you don't expect
> >> these changes to replicate, it's safe to delete any of this bucket's
> >> bilog entries with `radosgw-admin bilog trim --end-marker 9
> >> --bucket=X`. depending on ceph version, you may need to run this trim
> >> command in a loop until the `bilog list` output is empty

> >> radosgw does eventually trim bilogs in the background after they're
> >> processed, but the secondary zone isn't processing them in this case

> >> > Thank you

> >> > ________________________________
> >> > From: Casey Bodley <cbodley@xxxxxxxxxx>
> >> > Sent: Tuesday, July 9, 2024 10:39 PM
> >> > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
> >> > Cc: Eugen Block <eblock@xxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> >> > Subject: Re:  Re: Large omap in index pool even if properly sharded
> >> > and not "OVER"

> >> > Email received from the internet. If in doubt, don't click any link nor open any
> >> > attachment !
> >> > ________________________________

> >> > in general, these omap entries should be evenly spread over the
> >> > bucket's index shard objects. but there are two features that may
> >> > cause entries to clump on a single shard:

> >> > 1. for versioned buckets, multiple versions of the same object name
> >> > map to the same index shard. this can become an issue if an
> >> > application is repeatedly overwriting an object without cleaning up
> >> > old versions. lifecycle rules can help to manage these noncurrent
> >> > versions

> >> > 2. during a multipart upload, all of the parts are tracked on the same
> >> > index shard as the final object name. if applications are leaving a
> >> > lot of incomplete multipart uploads behind (especially if they target
> >> > the same object name) this can lead to similar clumping. the S3 api
> >> > has operations to list and abort incomplete multipart uploads, along
> >> > with lifecycle rules to automate their cleanup

> >> > separately, multisite clusters use these same index shards to store
> >> > replication logs. if sync gets far enough behind, these log entries
> >> > can also lead to large omap warnings

> >> > On Tue, Jul 9, 2024 at 10:25 AM Szabo, Istvan (Agoda)
> >> > <Istvan.Szabo@xxxxxxxxx> wrote:

> >> > > It's the same bucket:
> >> > > [ https://gist.github.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d |
> https://gist.github.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d ]

> >> > > ________________________________
> >> > > From: Eugen Block <eblock@xxxxxx>
> >> > > Sent: Tuesday, July 9, 2024 8:03 PM
> >> > > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
> >> > > Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> >> > > Subject: Re:  Re: Large omap in index pool even if properly sharded
> >> > > and not "OVER"

> >> > > Email received from the internet. If in doubt, don't click any link nor open any
> >> > > attachment !
> >> > > ________________________________

> >> > > Are those three different buckets? Could you share the stats for each of them?

> >> > > radosgw-admin bucket stats --bucket=<BUCKET>

> >> > > Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx>:

> >> > > > Hello,

> >> > > > Yeah, still:

> >> > > > the .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l
> >> > > > 290005

> >> > > > and the
> >> > > > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726 | wc -l
> >> > > > 289378

> >> > > > And just make me happy more I have one more
> >> > > > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.6 | wc -l
> >> > > > 181588

> >> > > > This is my crush tree (I'm using host based crush rule)
>>> > > > [
>>> > > > https://gist.githubusercontent.com/Badb0yBadb0y/9bea911701184a51575619bc99cca94d/raw/e5e4a918d327769bb874aaed279a8428fd7150d5/gistfile1.txt
> >> > > > |
> https://gist.githubusercontent.com/Badb0yBadb0y/9bea911701184a51575619bc99cca94d/raw/e5e4a918d327769bb874aaed279a8428fd7150d5/gistfile1.txt
> ]

> >> > > > I'm thinking could that be the issue that host 2s13-15 has less nvme
> >> > > > osd (however size wise same as in the other 12 host where have 8x
> >> > > > nvme osd) than the others?
> >> > > > But the pgs are located like this:

> >> > > > pg26.427
> >> > > > osd.261 host8
> >> > > > osd.488 host13
> >> > > > osd.276 host4

> >> > > > pg26.606
> >> > > > osd.443 host12
> >> > > > osd.197 host8
> >> > > > osd.524 host14

> >> > > > pg26.78c
> >> > > > osd.89 host7
> >> > > > osd.406 host11
> >> > > > osd.254 host6

> >> > > > If pg26.78c wouldn't be here I'd say 100% the nvme osd distribution
> >> > > > based on host is the issue, however this pg is not located on any of
> >> > > > the 4x nvme osd nodes 😕

> >> > > > Ty

> >> > > > ________________________________
> >> > > > From: Eugen Block <eblock@xxxxxx>
> >> > > > Sent: Tuesday, July 9, 2024 6:02 PM
> >> > > > To: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> >> > > > Subject:  Re: Large omap in index pool even if properly
> >> > > > sharded and not "OVER"

> >> > > > Email received from the internet. If in doubt, don't click any link
> >> > > > nor open any attachment !
> >> > > > ________________________________

> >> > > > Hi,

> >> > > > the number of shards looks fine, maybe this was just a temporary
> >> > > > burst? Did you check if the rados objects in the index pool still have
> >> > > > more than 200k omap objects? I would try someting like

> >> > > > rados -p <index_pool> listomapkeys
> >> > > > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l

> >> > > > Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx>:

> >> > > >> Hi,

> >> > > >> I have a pretty big bucket which sharded with 1999 shard so in
> >> > > >> theory can hold close to 200m objects (199.900.000).
> >> > > >> Currently it has 54m objects.

> >> > > >> Bucket limit check looks also good:
> >> > > >> "bucket": ""xyz,
> >> > > >> "tenant": "",
> >> > > >> "num_objects": 53619489,
> >> > > >> "num_shards": 1999,
> >> > > >> "objects_per_shard": 26823,
> >> > > >> "fill_status": "OK"

> >> > > >> This is the bucket id:
> >> > > >> "id": "9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1"

> >> > > >> This is the log lines:
> >> > > >> 2024-06-27T10:41:05.679870+0700 osd.261 (osd.261) 9643 : cluster
> >> > > >> [WRN] Large omap object found. Object:
> >> > > >> 26:e433e65c:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151:head
> >> > > >> PG: 26.3a67cc27 (26.427) Key count: 236919 Size
> >> > > >> (bytes):
> >> > > >> 89969920

> >> > > >> 2024-06-27T10:43:35.557835+0700 osd.89 (osd.89) 9000 : cluster [WRN]
> >> > > >> Large omap object found. Object:
> >> > > >> 26:31ff4df1:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726:head
> >> > > >> PG: 26.8fb2ff8c (26.78c) Key count: 236495 Size
> >> > > >> (bytes):
> >> > > >> 95560458

> >> > > >> Tried to deep scrub the affected pgs, tried to deep-scrub the
> >> > > >> mentioned osds in the log but didn't help.
> >> > > >> Why? What I'm missing?

> >> > > >> Thank you in advance for your help.

> >> > > >> ________________________________
> >> > > >> This message is confidential and is for the sole use of the intended
> >> > > >> recipient(s). It may also be privileged or otherwise protected by
> >> > > >> copyright or other legal rules. If you have received it by mistake
> >> > > >> please let us know by reply email and delete it from your system. It
> >> > > >> is prohibited to copy this message or disclose its content to
> >> > > >> anyone. Any confidentiality or privilege is not waived or lost by
> >> > > >> any mistaken delivery or unauthorized disclosure of the message. All
> >> > > >> messages sent to and from Agoda may be monitored to ensure
> >> > > >> compliance with company policies, to protect the company's interests
> >> > > >> and to remove potential malware. Electronic messages may be
> >> > > >> intercepted, amended, lost or deleted, or contain viruses.
> >> > > >> _______________________________________________
> >> > > >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> > > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx

> >> > > > _______________________________________________
> >> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx

> >> > > > ________________________________
> >> > > > This message is confidential and is for the sole use of the intended
> >> > > > recipient(s). It may also be privileged or otherwise protected by
> >> > > > copyright or other legal rules. If you have received it by mistake
> >> > > > please let us know by reply email and delete it from your system. It
> >> > > > is prohibited to copy this message or disclose its content to
> >> > > > anyone. Any confidentiality or privilege is not waived or lost by
> >> > > > any mistaken delivery or unauthorized disclosure of the message. All
> >> > > > messages sent to and from Agoda may be monitored to ensure
> >> > > > compliance with company policies, to protect the company's interests
> >> > > > and to remove potential malware. Electronic messages may be
> >> > > > intercepted, amended, lost or deleted, or contain viruses.

> >> > > ________________________________
> >> > > This message is confidential and is for the sole use of the intended
> >> > > recipient(s). It may also be privileged or otherwise protected by copyright or
> >> > > other legal rules. If you have received it by mistake please let us know by
> >> > > reply email and delete it from your system. It is prohibited to copy this
> >> > > message or disclose its content to anyone. Any confidentiality or privilege is
> >> > > not waived or lost by any mistaken delivery or unauthorized disclosure of the
> >> > > message. All messages sent to and from Agoda may be monitored to ensure
> >> > > compliance with company policies, to protect the company's interests and to
> >> > > remove potential malware. Electronic messages may be intercepted, amended, lost
> >> > > or deleted, or contain viruses.
> >> > > _______________________________________________
> >> > > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx

> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx

> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright or
> other legal rules. If you have received it by mistake please let us know by
> reply email and delete it from your system. It is prohibited to copy this
> message or disclose its content to anyone. Any confidentiality or privilege is
> not waived or lost by any mistaken delivery or unauthorized disclosure of the
> message. All messages sent to and from Agoda may be monitored to ensure
> compliance with company policies, to protect the company's interests and to
> remove potential malware. Electronic messages may be intercepted, amended, lost
> or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx