Re: millions of hex 80 0_0000 omap keys in single index shard for single bucket

Casey Bodley <cbodley@xxxxxxxxxx> · Thu, 21 Sep 2023 13:16:59 -0400

On Thu, Sep 21, 2023 at 12:21 PM Christopher Durham <caduceus42@xxxxxxx> wrote:
>
>
> Hi Casey,
>
> This is indeed a multisite setup. The other side shows that for
>
> # radosgw-admin sync status
>
> the oldest incremental change not applied is about a minute old, and that is consistent over a number of minutes, always the oldest incremental change a minute or two old.
>
> However:
>
> # radosgw-admin bucket sync status --bucket bucket-in-question
>
> shows a number of shards always behind, although it varies.
>
> The number of objects on each side in that bucket is close, and  to this point I have attributed that to the replication lag.
>
> One thing that came to mind is that the code that writes to say foo/bar/baz/objects ...
>
> will often delete the objects quickly after creating them. Perhaps the replication doesn't occur to
> the other side before they are deleted? Would that perhaps contribute to this?

sync should handle object deletion just fine. it'll see '404 Not
Found' errors when it tries to replicate them, and just continue on to
the next object. that shouldn't cause bucket sync to get stuck

>
> Not sure how this relates to the objects ending in '/' though, although they are in the same prefix hierarchy.
>
> To get out of this situation, what do I need to do:
>
> 1. radosgw-admin bucket sync init --bucket bucket-in-question on both sides?

'bucket sync init' clears the bucket's sync status, but nothing would
trigger rgw to restart the sync on it. you could try 'bucket sync run'
instead, though it's not especially reliable until the reef release so
you may need to rerun the command several times before it catches up
completely. once the bucket sync catches up, the source zone's bilog
entries would be eligible for automatic trimming

> 2. manually delete the 0_0000 objects in rados? (yuk).

you can use the 'bilog trim' command on a bucket to delete its log
entries, but i'd only consider doing that if you're satisfied that all
of the objects you care about have already replicated

>
> I've done #1 before when I had the other side of a multi site down for awhile before. I have not had that happen in the current situation (link down between sites).
>
> Thanks for anything you or others can offer.

for rgw multisite users in particular, i highly recommend trying out
the reef release. in addition to multisite resharding support, we made
a lot of improvements to multisite stability/reliability that we won't
be able to backport to pacific/quincy

>
> -Chris
>
>
> On Wednesday, September 20, 2023 at 07:33:07 PM MDT, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
>
> these keys starting with "<80>0_" appear to be replication log entries
> for multisite. can you confirm that this is a multisite setup? is the
> 'bucket sync status' mostly caught up on each zone? in a healthy
> multisite configuration, these log entries would eventually get
> trimmed automatically
>
> On Wed, Sep 20, 2023 at 7:08 PM Christopher Durham <caduceus42@xxxxxxx> wrote:
> >
> > I am using ceph 17.2.6 on Rocky 8.
> > I have a system that started giving me large omap object warnings.
> >
> > I tracked this down to a specific index shard for a single s3 bucket.
> >
> > rados -p <indexpool> listomapkeys .dir.<zoneid>.bucketid.nn.shardid
> > shows over 3 million keys for that shard. There are only about 2
> > million objects in the entire bucket according to a listing of the bucket
> > and radosgw-admin bucket stats --bucket bucketname. No other shard
> > has anywhere near this many index objects. Perhaps it should be noted that this
> > shard is the highest numbered shard for this bucket. For a bucket with
> > 16 shards, this is shard 15.
> >
> > If I look at the list of omapkeys generated, there are *many*
> > beginning with "<80>0_0000", almost the entire set of the three + million
> > keys in the shard. These are index objects in the so-called 'ugly' namespace. The rest ofthey omapkeys appear to be normal.
> >
> > The 0_0000 after the <80> indicates some sort of 'bucket log index' according to src/cls/rgw/cls_rgw.cc.
> > However, using some sed magic previously discussed here, I ran:
> >
> > rados -p <indexpool> getomapval .dir.<zoneid>.bucketid.nn.shardid --omap-key-file /tmp/key.txt
> >
> > Where /tmp/key.txt contains only the funny <80>0_0000 key name without a newline
> >
> > The output of this shows, in a hex dump, the object name to which the index
> > refers, which was at one time a valid object.
> >
> > However, that object no longer exists in the bucket, and based on expiration policy, was
> > previously deleted. Let's say, in the hex dump, that the object was:
> >
> > foo/bar/baz/object1.bin
> >
> > The prefix foo/bar/baz/ used to have 32 objects, say foo/bar/baz/{object1.bin, object2.bin, ... }
> > An s3api listing shows that those objects no longer exist (and that is OK, as they  were previously deleted).
> > BUT, now, there is a weirdo object left in the bucket:
> >
> > foo/bar/baz/ <- with the slash at the end, and it is an object not a PRE (fix).
> >
> > All objects under foo/ have a 3 day lifecycle expiration. If I wait(at most) 3 days, the weirdo object with '/'
> > at the end will be deleted, or I can delete it manually using aws s3api. But either way, the log index
> > objects, <80>0_0000.... remain.
> >
> > The bucket in question is heavily used. But with over 3 million of these <80>0_0000 objects (and growing)
> > in a single shard, I am currently at a loss as to what to do or how to stop this from occuring.
> > I've poked around at a few other buckets, and I found a few others that have this problem, but not enoughto cause a large omap warning. (A few hundred <80>0_000.... index objects in a shard), no where near enoughto cause the large omap warning that led me to this post.
> >
> > Any ideas?
>
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx