Re: Large omap in index pool even if properly sharded and not "OVER"

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 12 Jul 2024 13:52:21 +0200 (CEST)

----- Le 11 Juil 24, à 0:23, Richard Bade hitrich@xxxxxxxxx a écrit :

> Hi Casey,
> Thanks for that info on the bilog. I'm in a similar situation with
> large omap objects and we have also had to reshard buckets on
> multisite losing the index on the secondary.
> We also now have a lot of buckets with sync disable so I wanted to
> check that it's always safe to trim the bilog on buckets with sync
> disabled?
> I can see some stale entries with "completed" state and a timestamp of
> a number of months ago but also some that say pending and have no
> timestamp.
> 
> Istvan, I can also possibly help with your orphaned 40TB on the secondary zone.
> Each object has the bucket marker in its name. If you do a `rados -p
> {pool_name} ls` and find all the ones that start with the bucket
> marker (found with `radosgw-admin bucket stats
> --bucket={bucket_name}`) then you can do one of two things:
> 1, `rados rm` the object
> 2, restore the index with info from the object itself
>    - create a dummy index template (use `radosgw-admin bi get` on a
> known good index to get the structure)
>    - grab the etag from the object xattribs and use this and the name
> in the template (`rados -p {pool} getxattr {objname} user.rgw.etag`)
>    - use ` radosgw-admin bi put` to create the index
>    - use `radosgw-admin bucket check --check-objects --fix
> --bucket={bucket_name}` to fix up the bucket object count and object
> sizes at the end

One could also use `radosgw-admin object reindex --bucket {bucket_name}` to scan the data pool for objects that belong to a given bucket and add those objects back to the bucket index.

Same logic as rgw-restore-bucket-index [1][2] script that has proven to be successful in recovering bucket indexes destroyed by resharding [3].

Regards,
Frédéric.

[1] https://docs.ceph.com/en/latest/man/8/rgw-restore-bucket-index/
[2] https://github.com/ceph/ceph/blob/main/src/rgw/rgw-restore-bucket-index
[3] https://github.com/ceph/ceph/pull/50329

> 
> This process takes quite some time and I can't say if it's 100%
> perfect but it enabled us to get to a state where we could delete the
> buckets and clean up the objects.
> I hope this helps.
> 
> Regards,
> Richard
> 
> On Thu, 11 Jul 2024 at 01:25, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>>
>> On Tue, Jul 9, 2024 at 12:41 PM Szabo, Istvan (Agoda)
>> <Istvan.Szabo@xxxxxxxxx> wrote:
>> >
>> > Hi Casey,
>> >
>> > 1.
>> > Regarding versioning, the user doesn't use verisoning it if I'm not mistaken:
>> > https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
>> >
>> > 2.
>> > Regarding multiparts, if it would have multipart thrash, it would be listed
>> > here:
>> > https://gist.githubusercontent.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d/raw/baee46865178fff454c224040525b55b54e27218/gistfile1.txt
>> > as a rgw.multimeta under the usage, right?
>> >
>> > 3.
>> > Regarding the multisite idea, this bucket has been a multisite bucket last year
>> > but we had to reshard (accepting to loose the replica on the 2nd site and just
>> > keep it in the master site) and that time as expected it has disappeared
>> > completely from the 2nd site (I guess the 40TB thrash still there but can't
>> > really find it how to clean 🙁 ). Now it is a single site bucket.
>> > Also it is the index pool, multisite logs should go to the rgw.log pool
>> > shouldn't it?
>>
>> some replication logs are in the log pool, but the per-object logs are
>> stored in the bucket index objects. you can inspect these with
>> `radosgw-admin bilog list --bucket=X`. by default, that will only list
>> --max-entries=1000. you can add --shard-id=Y to look at specific
>> 'large omap' objects
>>
>> even if your single-site bucket doesn't exist on the secondary zone,
>> changes on the primary zone are probably still generating these bilog
>> entries. you would need to do something like `radosgw-admin bucket
>> sync disable --bucket=X` to make it stop. because you don't expect
>> these changes to replicate, it's safe to delete any of this bucket's
>> bilog entries with `radosgw-admin bilog trim --end-marker 9
>> --bucket=X`. depending on ceph version, you may need to run this trim
>> command in a loop until the `bilog list` output is empty
>>
>> radosgw does eventually trim bilogs in the background after they're
>> processed, but the secondary zone isn't processing them in this case
>>
>> >
>> > Thank you
>> >
>> >
>> > ________________________________
>> > From: Casey Bodley <cbodley@xxxxxxxxxx>
>> > Sent: Tuesday, July 9, 2024 10:39 PM
>> > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>> > Cc: Eugen Block <eblock@xxxxxx>; ceph-users@xxxxxxx <ceph-users@xxxxxxx>
>> > Subject: Re:  Re: Large omap in index pool even if properly sharded
>> > and not "OVER"
>> >
>> > Email received from the internet. If in doubt, don't click any link nor open any
>> > attachment !
>> > ________________________________
>> >
>> > in general, these omap entries should be evenly spread over the
>> > bucket's index shard objects. but there are two features that may
>> > cause entries to clump on a single shard:
>> >
>> > 1. for versioned buckets, multiple versions of the same object name
>> > map to the same index shard. this can become an issue if an
>> > application is repeatedly overwriting an object without cleaning up
>> > old versions. lifecycle rules can help to manage these noncurrent
>> > versions
>> >
>> > 2. during a multipart upload, all of the parts are tracked on the same
>> > index shard as the final object name. if applications are leaving a
>> > lot of incomplete multipart uploads behind (especially if they target
>> > the same object name) this can lead to similar clumping. the S3 api
>> > has operations to list and abort incomplete multipart uploads, along
>> > with lifecycle rules to automate their cleanup
>> >
>> > separately, multisite clusters use these same index shards to store
>> > replication logs. if sync gets far enough behind, these log entries
>> > can also lead to large omap warnings
>> >
>> > On Tue, Jul 9, 2024 at 10:25 AM Szabo, Istvan (Agoda)
>> > <Istvan.Szabo@xxxxxxxxx> wrote:
>> > >
>> > > It's the same bucket:
>> > > https://gist.github.com/Badb0yBadb0y/d80c1bdb8609088970413969826d2b7d
>> > >
>> > >
>> > > ________________________________
>> > > From: Eugen Block <eblock@xxxxxx>
>> > > Sent: Tuesday, July 9, 2024 8:03 PM
>> > > To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
>> > > Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
>> > > Subject: Re:  Re: Large omap in index pool even if properly sharded
>> > > and not "OVER"
>> > >
>> > > Email received from the internet. If in doubt, don't click any link nor open any
>> > > attachment !
>> > > ________________________________
>> > >
>> > > Are those three different buckets? Could you share the stats for each of them?
>> > >
>> > > radosgw-admin bucket stats --bucket=<BUCKET>
>> > >
>> > > Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx>:
>> > >
>> > > > Hello,
>> > > >
>> > > > Yeah, still:
>> > > >
>> > > > the .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l
>> > > > 290005
>> > > >
>> > > > and the
>> > > > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726 | wc -l
>> > > > 289378
>> > > >
>> > > > And just make me happy more I have one more
>> > > > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.6 | wc -l
>> > > > 181588
>> > > >
>> > > > This is my crush tree (I'm using host based crush rule)
>> > > > https://gist.githubusercontent.com/Badb0yBadb0y/9bea911701184a51575619bc99cca94d/raw/e5e4a918d327769bb874aaed279a8428fd7150d5/gistfile1.txt
>> > > >
>> > > > I'm thinking could that be the issue that host 2s13-15 has less nvme
>> > > > osd (however size wise same as in the other 12 host where have 8x
>> > > > nvme osd) than the others?
>> > > > But the pgs are located like this:
>> > > >
>> > > > pg26.427
>> > > > osd.261 host8
>> > > > osd.488 host13
>> > > > osd.276 host4
>> > > >
>> > > > pg26.606
>> > > > osd.443 host12
>> > > > osd.197 host8
>> > > > osd.524 host14
>> > > >
>> > > > pg26.78c
>> > > > osd.89 host7
>> > > > osd.406 host11
>> > > > osd.254 host6
>> > > >
>> > > > If pg26.78c wouldn't be here I'd say 100% the nvme osd distribution
>> > > > based on host is the issue, however this pg is not located on any of
>> > > > the 4x nvme osd nodes 😕
>> > > >
>> > > > Ty
>> > > >
>> > > > ________________________________
>> > > > From: Eugen Block <eblock@xxxxxx>
>> > > > Sent: Tuesday, July 9, 2024 6:02 PM
>> > > > To: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
>> > > > Subject:  Re: Large omap in index pool even if properly
>> > > > sharded and not "OVER"
>> > > >
>> > > > Email received from the internet. If in doubt, don't click any link
>> > > > nor open any attachment !
>> > > > ________________________________
>> > > >
>> > > > Hi,
>> > > >
>> > > > the number of shards looks fine, maybe this was just a temporary
>> > > > burst? Did you check if the rados objects in the index pool still have
>> > > > more than 200k omap objects? I would try someting like
>> > > >
>> > > > rados -p <index_pool> listomapkeys
>> > > > .dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151 | wc -l
>> > > >
>> > > >
>> > > > Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx>:
>> > > >
>> > > >> Hi,
>> > > >>
>> > > >> I have a pretty big bucket which sharded with 1999 shard so in
>> > > >> theory can hold close to 200m objects (199.900.000).
>> > > >> Currently it has 54m objects.
>> > > >>
>> > > >> Bucket limit check looks also good:
>> > > >>  "bucket": ""xyz,
>> > > >>  "tenant": "",
>> > > >>  "num_objects": 53619489,
>> > > >>  "num_shards": 1999,
>> > > >>  "objects_per_shard": 26823,
>> > > >>  "fill_status": "OK"
>> > > >>
>> > > >> This is the bucket id:
>> > > >> "id": "9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1"
>> > > >>
>> > > >> This is the log lines:
>> > > >> 2024-06-27T10:41:05.679870+0700 osd.261 (osd.261) 9643 : cluster
>> > > >> [WRN] Large omap object found. Object:
>> > > >> 26:e433e65c:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.151:head
>> > > >> PG: 26.3a67cc27 (26.427) Key count: 236919 Size
>> > > >> (bytes):
>> > > >> 89969920
>> > > >>
>> > > >> 2024-06-27T10:43:35.557835+0700 osd.89 (osd.89) 9000 : cluster [WRN]
>> > > >> Large omap object found. Object:
>> > > >> 26:31ff4df1:::.dir.9213182a-14ba-48ad-bde9-289a1c0c0de8.2479481907.1.726:head
>> > > >> PG: 26.8fb2ff8c (26.78c) Key count: 236495 Size
>> > > >> (bytes):
>> > > >> 95560458
>> > > >>
>> > > >> Tried to deep scrub the affected pgs, tried to deep-scrub the
>> > > >> mentioned osds in the log but didn't help.
>> > > >> Why? What I'm missing?
>> > > >>
>> > > >> Thank you in advance for your help.
>> > > >>
>> > > >> ________________________________
>> > > >> This message is confidential and is for the sole use of the intended
>> > > >> recipient(s). It may also be privileged or otherwise protected by
>> > > >> copyright or other legal rules. If you have received it by mistake
>> > > >> please let us know by reply email and delete it from your system. It
>> > > >> is prohibited to copy this message or disclose its content to
>> > > >> anyone. Any confidentiality or privilege is not waived or lost by
>> > > >> any mistaken delivery or unauthorized disclosure of the message. All
>> > > >> messages sent to and from Agoda may be monitored to ensure
>> > > >> compliance with company policies, to protect the company's interests
>> > > >> and to remove potential malware. Electronic messages may be
>> > > >> intercepted, amended, lost or deleted, or contain viruses.
>> > > >> _______________________________________________
>> > > >> ceph-users mailing list -- ceph-users@xxxxxxx
>> > > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> > > >
>> > > >
>> > > > _______________________________________________
>> > > > ceph-users mailing list -- ceph-users@xxxxxxx
>> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> > > >
>> > > > ________________________________
>> > > > This message is confidential and is for the sole use of the intended
>> > > > recipient(s). It may also be privileged or otherwise protected by
>> > > > copyright or other legal rules. If you have received it by mistake
>> > > > please let us know by reply email and delete it from your system. It
>> > > > is prohibited to copy this message or disclose its content to
>> > > > anyone. Any confidentiality or privilege is not waived or lost by
>> > > > any mistaken delivery or unauthorized disclosure of the message. All
>> > > > messages sent to and from Agoda may be monitored to ensure
>> > > > compliance with company policies, to protect the company's interests
>> > > > and to remove potential malware. Electronic messages may be
>> > > > intercepted, amended, lost or deleted, or contain viruses.
>> > >
>> > >
>> > >
>> > >
>> > > ________________________________
>> > > This message is confidential and is for the sole use of the intended
>> > > recipient(s). It may also be privileged or otherwise protected by copyright or
>> > > other legal rules. If you have received it by mistake please let us know by
>> > > reply email and delete it from your system. It is prohibited to copy this
>> > > message or disclose its content to anyone. Any confidentiality or privilege is
>> > > not waived or lost by any mistaken delivery or unauthorized disclosure of the
>> > > message. All messages sent to and from Agoda may be monitored to ensure
>> > > compliance with company policies, to protect the company's interests and to
>> > > remove potential malware. Electronic messages may be intercepted, amended, lost
>> > > or deleted, or contain viruses.
>> > > _______________________________________________
>> > > ceph-users mailing list -- ceph-users@xxxxxxx
>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx