Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Wed, 10 Nov 2021 14:29:56 +0000

1k is a bit rough no? Even if you set 1k on ssd the min alloc still will be 4k so it will use 4k to store your 1k objects.

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2021. Nov 10., at 10:53, Boris Behrens <bb@xxxxxxxxx> wrote:

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________
I am just creating a bucket with a lot of files to test it. Who would have thought that uploading a million 1k files would take days?

Am Di., 9. Nov. 2021 um 00:50 Uhr schrieb prosergey07 <prosergey07@xxxxxxxxx<mailto:prosergey07@xxxxxxxxx>>:
When resharding is performed I believe its considered as bucket operation and undergoes through updating the bucket stats. Like new bucket shard is created and it may increase the number of objects within the bucket stats.
 If it was broken during resharding, you could check the current bucket id from:
 radosgw-admin metadata get "bucket:BUCKET_NAME".

That would hive an idea which bucket index objects to keep.

 Then you could remove corrupted bucket shards (not the ones with the bucket id from the previous command) .dir.corrupted_bucket_index.SHARD_NUM objects from bucket.index pool:

rados -p bucket.index .dir.corrupted_bucket_index.SHARD_NUM

Where SHARD_NUM is the shard number you want to delete.

 And then running "radosgw-admin bucket check --fix --bucket=BUCKET_NAME"

 That should have resolved your issue with the number of objects.

 As for slow object deletion. Do you run your metadata pools for rgw on nvme drives ? Specifically bucket.index pool. The problem is that you have a lot of objects and probably not enough shards. Radosgw retrieves the list of objects from bucket.index and if I remember correct it retrieves them as ordered list which is very expensive operation. Hence handful of time might be spent just on getting the object list.

 We get 1000 objects per second deleted  inside our storage.

I would not recommend using "--inconsistent-index" to avoid more consitency issues.

Надіслано з пристрою Galaxy

-------- Оригінальне повідомлення --------
Від: mhnx <morphinwithyou@xxxxxxxxx<mailto:morphinwithyou@xxxxxxxxx>>
Дата: 08.11.21 13:28 (GMT+02:00)
Кому: Сергей Процун <prosergey07@xxxxxxxxx<mailto:prosergey07@xxxxxxxxx>>
Копія: "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx<mailto:Istvan.Szabo@xxxxxxxxx>>, Boris Behrens <bb@xxxxxxxxx<mailto:bb@xxxxxxxxx>>, Ceph Users <ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
Тема: Re:  Re: large bucket index in multisite environement (how to deal with large omap objects warning)?

(There should not be any issues using rgw for other buckets while re-sharding.)
If it is then disabling the bucket access will work right? Also sync should be disabled.

Yes, after the manual reshard it should clear the leftovers but in my situation resharding failed and I got double entries for that bucket.
I didn't push further, instead I divide the bucket to new buckets and reduce object count with a new bucket tree. Copied all of the objects with rclone and started bucket remove "radosgw-admin bucket rm --bucket=mybucket --bypass-gc --purge-objects --max-concurrent-ios=128" it has been very very long time "started at Sep08" and it is still working. There was 250M objects in that bucket and after the manual reshard faiI I got 500M object count when I check with bucket stats num_objects. Now I have;
"size_kb": 10648067645,
"num_objects": 132270190

Remove speed is 50-60 objects in a second. It's not because of the cluster speed. Cluster is fine.
I have space so I let it go. When I see stable object count I will stop the remove process and start again with the " --inconsistent-index" parameter.
I wonder is it safe to use the parameter with referenced objects? I want to learn how "--inconsistent-index" works and what it does.

Сергей Процун <prosergey07@xxxxxxxxx<mailto:prosergey07@xxxxxxxxx>>, 5 Kas 2021 Cum, 17:46 tarihinde şunu yazdı:
There should not be any issues using rgw for other buckets while re-sharding.

As for doubling number of objects after reshard is an interesting situation. After the manual reshard is done, there might be leftover from the old bucket index. As during reshard new .dir.new_bucket_index objects are created. They contain all data related to the objects which are stored in buckets.data pool. Just wondering if the issue with the doubled number of objects was related to old bucket index. If so its save to delete old bucket index.

 In the perfect world, it would be ideal to know the eventoal number of objects inside the bucket and set number of shards to the corresponding setting initially.

 In the real world when the client re-purpose the usage of the bucket, we have to deal with reshards.

пт, 5 лист. 2021, 14:43 користувач mhnx <morphinwithyou@xxxxxxxxx<mailto:morphinwithyou@xxxxxxxxx>> пише:
I also use this method and I hate it.

Stopping all of the RGW clients is never an option! It shouldn't be.
Sharding is hell. I was have 250M objects in a bucket and reshard failed
after 2days and object count doubled somehow! 2 days of downtime is not an
option.

I wonder if I stop the write-read on a bucket and while resharding it is
there any problem of using RGW's with all other buckets?

Nowadays I advise splitting buckets as much as you can! That means changing
your apps directory tree but this design requires it.
You need to plan object count at least for 5 years and create ones.
Usually I use 101 shards which means 10.100.000 objects.
Also If I need to use versioning I use 2x101 or 3x101 because versions are
hard to predict. You need to predict how many versions you need and set a
lifecycle even before using the bucket!
The max shard that I use 1999. I'm not happy about it but sometimes you
gotta do what you need to do.
Fighting with customers is not an option, you can only advise changing
their apps folder tree but I've never seen someone accept the deal without
arguing.

My offers usually like this:
1- Core files bucket: no need to change or very limited changes. "calculate
the object count and multiply with 2"
2- Hot data bucket: There will be daily changes and versioning. "calculate
the object count and multiply with 3"
3- Cold data bucket[s]: There will be no daily changes. You should open new
buckets every Year or Month. This is good to keep it clean and steady. No
need for versioning and Multisite Will not suffer due to barely changes.
4- Temp files bucket[s]: This is so important. If you're crawling millions
of millions objects everyday and delete it at the end of the week or month
then you should definitely use a temp bucket.  No versioning, No multisite,
No index if it's possible.

Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx<mailto:Istvan.Szabo@xxxxxxxxx>>, 5 Kas 2021 Cum, 12:30
tarihinde şunu yazdı:

> You mean prepare or reshard?
> Prepare:
> I collect as much information for the users before onboarding so I can
> prepare for their use case in the future and set things up.
>
> Preshard:
> After created the bucket:
> radosgw-admin bucket reshard --bucket=ex-bucket --num-shards=101
>
> Also when you shard the buckets, you need to use prime numbers.
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx><mailto:istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>>
> ---------------------------------------------------
>
> From: Boris Behrens <bb@xxxxxxxxx<mailto:bb@xxxxxxxxx>>
> Sent: Friday, November 5, 2021 4:22 PM
> To: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx<mailto:Istvan.Szabo@xxxxxxxxx>>; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> Subject: Re:  large bucket index in multisite environement
> (how to deal with large omap objects warning)?
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> ________________________________
> Cheers Istvan,
>
> how do you do this?
>
> Am Do., 4. Nov. 2021 um 19:45 Uhr schrieb Szabo, Istvan (Agoda) <
> Istvan.Szabo@xxxxxxxxx<mailto:Istvan.Szabo@xxxxxxxxx><mailto:Istvan.Szabo@xxxxxxxxx<mailto:Istvan.Szabo@xxxxxxxxx>>>:
> This one you need to prepare, you beed to preshard the bucket which you
> know that will hold more than millions of objects.
>
> I have a bucket where we store 1.2 billions of objects with 24xxx shard.
> No omap issue.
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx><mailto:istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>>
> ---------------------------------------------------
>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>

--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx