Re: Radosgw bucket check fix doesn't do anything

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Fri, 20 Sep 2024 10:44:47 +0200 (CEST)

Hi Reid, 

Only the metadata / index side. "invalid_multipart_entries" relates to multipart index entries that don't have a corresponding .meta index entry anymore, the entry listing all parts of a multipart upload. 
The --fix should have removed these multipart index entries from the bucket index and updated the header object with new calculated stats [1], but obviously it failed at doing so. 

You may be facing this bug [2]. If that's the case, then upgrading your cluster and running the tool again may help. Which version of Ceph is this, btw? 

In the past, we've been using the following procedure to manually clean up the bucket index from orphaned multipart entries that we couldn't remove because the rados data objects (multipart parts) were missing: 

# Generate list of multipart objects 
aws s3api list-multipart-uploads --endpoint=https://s3.peta.univ-lorraine.fr:9443 --bucket $bucket_name > list-multipart-uploads.txt 

# Get bucket ID 
bucket_id=$(radosgw-admin bucket stats --bucket=$bucket_name | grep '"id"' | cut -d '"' -f 4) 

# List all shards 
rados -p $index_pool_name ls | grep "$bucket_id" | sort -n -t '.' -k6 

# Get all shards with their list of omap keys 
mkdir "$bucket_id" 
for i in $(rados -p $index_pool_name ls | grep "$bucket_id"); do echo $i ; rados -p $index_pool_name listomapkeys $i > "${bucket_id}/${i}" ; done 

# Get all UploadId 
grep '"UploadId"' list-multipart-uploads.txt | cut -d '"' -f 4 > UploadIds.txt 

# Identify which UploadId belongs to which shard(s) 
fgrep -f UploadIds.txt ${bucket_id}/.dir.${bucket_id}* | sed -e "s/^${bucket_id}\///g" > UploadId-to-shard.txt 

# Cleanup these entries 
while IFS=':' read -r object key ; do echo "Removing Key ${key}" ; rados -p ${index_pool_name} rmomapkey "${object}" "${key}" ; done < UploadId-to-shard.txt > rmomapkey.log 

The difference with your case is that we could list them with 'aws s3api list-multipart-uploads', but maybe you can identify the ompakeys to remove based on the 'invalid_multipart_entries' list. 

Besides, after cleaning up the index, you may want to run the rgw-orphan-list [3] to identify eventual orphaned multipart objects left in the data pool and remove them with a 'rados rm' comand. 

Good luck, 
Frédéric. 

[1] https://www.ibm.com/docs/en/storage-ceph/7?topic=management-managing-bucket-index-entries 
[2] https://tracker.ceph.com/issues/53874 
[3] https://access.redhat.com/solutions/4544621 

----- Le 19 Sep 24, à 16:34, Reid Guyett <reid.guyett@xxxxxxxxx> a écrit : 

> Hi,

> I didn't notice any changes in the counts after running the check --fix | check
> --check-objects --fix. Also the bucket isn't versioned.

> I will take a look at the index vs the radoslist. Which side would cause the
> 'invalid_multipart_entries"?

> Thanks

> On Thu, Sep 19, 2024 at 5:50 AM Frédéric Nass < [
> mailto:frederic.nass@xxxxxxxxxxxxxxxx | frederic.nass@xxxxxxxxxxxxxxxx ] >
> wrote:

>> Oh, by the way, since 35470 is near two times 18k, couldn't it be that the
>> source bucket is versioned and the destination bucket only got the most recent
>> copy of each object?

>> Regards,
>> Frédéric.

>> ----- Le 18 Sep 24, à 20:39, Reid Guyett < [ mailto:reid.guyett@xxxxxxxxx |
>> reid.guyett@xxxxxxxxx ] > a écrit :

>>> Hi Frederic,
>>> Thanks for those notes.

>>> When I scan the list of multiparts, I do not see the items from the invalid
>>> multipart list. Example:
>>> The first entry here

>>>> radosgw-admin bucket check --bucket mimir-prod | head
>>>> {
>>>> "invalid_multipart_entries": [
>>>> "_multipart_network/01H9CFRA45MJWBHQRCHRR4JHV4/index.sJRTCoqiZvlge2cjz6gLU7DwuLI468zo.2",
>>>> ...

>>> does not appear in the abort-multipart-upload.txt I get from the
>>> list-multipart-uploads

>>>> $ grep -c 01H9CFRA45MJWBHQRCHRR4JHV4 abort-multipart-upload.txt
>>>> 0

>>> If I try to abort the invalid multipart, it says it does not exist.

>>>> $ aws --profile mimir-prod --endpoint-url [ https://my.objectstorage.domain/ |
>>>> https://my.objectstorage.domain ] s3api abort-multipart-upload --bucket
>>>> mimir-prod --key "network/01H9CFRA45MJWBHQRCHRR4JHV4/index" --upload-id
>>>> "sJRTCoqiZvlge2cjz6gLU7DwuLI468zo.2"

>>>> An error occurred (NoSuchUpload) when calling the AbortMultipartUpload
>>>> operation: Unknown
>>> I seem to have many buckets with this type of state. I'm hoping to be able to
>>> fix them.

>>> Thanks!

>>> On Wed, Sep 18, 2024 at 4:21 AM Frédéric Nass < [
>>> mailto:frederic.nass@xxxxxxxxxxxxxxxx | frederic.nass@xxxxxxxxxxxxxxxx ] >
>>> wrote:

>>>> Hi Reid,

>>>> The bucket check --fix will not clean up aborted multipart uploads. An S3 client
>>>> will.

>>>> You need to either set a Lifecycle policy on buckets to have these cleaned up
>>>> automatically after some time

>>>> ~/ cat /home/lifecycle.xml
>>>> <LifecycleConfiguration>
>>>> <Rule>
>>>> <AbortIncompleteMultipartUpload>
>>>> <DaysAfterInitiation>3</DaysAfterInitiation>
>>>> </AbortIncompleteMultipartUpload>
>>>> <Prefix></Prefix>
>>>> <Status>Enabled</Status>
>>>> </Rule>
>>>> </LifecycleConfiguration>

>>>> ~/ s3cmd setlifecycle lifecycle.xml s3://bucket-test

>>>> Or get rid of them manually by using an s3 client

>>>> ~/ aws s3api list-multipart-uploads --endpoint= [
>>>> https://my.objectstorage.domain/ | https://my.objectstorage.domain ] --bucket
>>>> mimir-prod | jq -r '.Uploads[] | "--key \"\(.Key)\" --upload-id \(.UploadId)"'
>>>> > abort-multipart-upload.txt

>>>> ~/ max=$(cat abort-multipart-upload.txt | wc -l); i=1; while read -r line; do
>>>> echo -n "$i/$max"; ((i=i+1)); eval "aws s3api abort-multipart-upload
>>>> --endpoint= [ https://my.objectstorage.domain/ |
>>>> https://my.objectstorage.domain ] --bucket mimir-prod $line"; done <
>>>> abort-multipart-upload.txt

>>>> Regards,
>>>> Frédéric.

>>>> ----- Le 17 Sep 24, à 14:27, Reid Guyett [ mailto:reid.guyett@xxxxxxxxx |
>>>> reid.guyett@xxxxxxxxx ] a écrit :

>>>> > Hello,

>>>> > I recently moved a bucket from 1 cluster to another cluster using rclone. I
>>>> > noticed that the source bucket had around 35k objects and the destination
>>>> > bucket only had around 18k objects after the sync was completed.

>>>> > Source bucket stats showed:

>>>> >> radosgw-admin bucket stats --bucket mimir-prod | jq .usage
>>>> >> {
>>>> >> "rgw.main": {
>>>> >> "size": 4321515978174,
>>>> >> "size_actual": 4321552605184,
>>>> >> "size_utilized": 4321515978174,
>>>> >> "size_kb": 4220230448,
>>>> >> "size_kb_actual": 4220266216,
>>>> >> "size_kb_utilized": 4220230448,
>>>> >> "num_objects": 35470
>>>> >> },
>>>> >> "rgw.multimeta": {
>>>> >> "size": 0,
>>>> >> "size_actual": 0,
>>>> >> "size_utilized": 66609,
>>>> >> "size_kb": 0,
>>>> >> "size_kb_actual": 0,
>>>> >> "size_kb_utilized": 66,
>>>> >> "num_objects": 2467
>>>> >> }
>>>> >> }

>>>> > Destination bucket stats showed:

>>>> >> radosgw-admin bucket stats --bucket mimir-prod | jq .usage
>>>> >> {
>>>> >> "rgw.main": {
>>>> >> "size": 4068176326491,
>>>> >> "size_actual": 4068212576256,
>>>> >> "size_utilized": 4068176326491,
>>>> >> "size_kb": 3972828444,
>>>> >> "size_kb_actual": 3972863844,
>>>> >> "size_kb_utilized": 3972828444,
>>>> >> "num_objects": 18525
>>>> >> },
>>>> >> "rgw.multimeta": {
>>>> >> "size": 0,
>>>> >> "size_actual": 0,
>>>> >> "size_utilized": 108,
>>>> >> "size_kb": 0,
>>>> >> "size_kb_actual": 0,
>>>> >> "size_kb_utilized": 1,
>>>> >> "num_objects": 4
>>>> >> }
>>>> >> }

>>>> > When I checked the source bucket using aws cli tool it showed around 18k
>>>> > objects. The bucket was actively being used so the 18k is slightly
>>>> > different.

>>>>>> aws --profile mimir-prod --endpoint-url [ https://my.objectstorage.domain/ |
>>>> >> https://my.objectstorage.domain ]
>>>> >> s3api list-objects --bucket mimir-prod > mimir_objs
>>>> >> cat mimir_objs | grep -c "Key"
>>>> >> 18090

>>>> > I did a check on the source bucket and it showed a lot of invalid
>>>> > multipart objects.

>>>> >> radosgw-admin bucket check --bucket mimir-prod | head
>>>> >> {
>>>> >> "invalid_multipart_entries": [

>>>> >> "_multipart_network/01H9CFRA45MJWBHQRCHRR4JHV4/index.sJRTCoqiZvlge2cjz6gLU7DwuLI468zo.2",

>>>> >> "_multipart_network/01HMCCRMTC5F4BFCZ56BKHTMWQ/index.6ypGbeMr6Jg3y7xAL8yrLL-v4sbFzjSA.3",

>>>> >> "_multipart_network/01HMFKR56RRZNX9VT9B4F49MMD/chunks/000001.JIC7fFA_q96nal1yGXsVSPCY8EMe5AU8.2",

>>>> >> "_multipart_network/01HMFKSND2E5BWF6QVTX8SDRRQ/index.57aSNeXn3j70H4EHfbNCD2RpoOp-P1Bv.2",

>>>> >> "_multipart_network/01HMFKTDNA3FVSWW7N8KYY2C7N/chunks/000001.2~kRjRbLWWDf1e40P40LUzdU3f_x2P46Q.2",

>>>> >> "_multipart_network/01HMFTMA8J1DEXYHKMVCXCC0GM/chunks/000001.GVajdCja0gHOLlgyFanF72A4B6ZqUpu5.2",

>>>> >> "_multipart_network/01HMFTMA8J1DEXYHKMVCXCC0GM/chunks/000001.GYaouEePvEdbQosCb5jLFCAHrSm9VoDh.2",

>>>> >> "_multipart_network/01HMFTMA8J1DEXYHKMVCXCC0GM/chunks/000001.r4HkP-JK-rBAWDoXBXKJJYEAjk39AswW.1",
>>>> >> ...

>>>> > So I tried to run `radosgw-admin bucket check --check-objects --bucket
>>>> > mimir-prod --fix` and it showed that it was cleaning things with thousands
>>>> > of lines like

>>>> >> 2024-09-17T12:19:42.212+0000 7fea25b6f9c0 0 check_disk_state(): removing
>>>> >> manifest part from index:
>>>> >> mimir-prod:_multipart_tenant_prod/01J7Q778YXJXE23SRQZM9ZA4NH/chunks/000001.2~m6EI5fHFWxI-RmWB6TeFSupu7vVrCgh.2
>>>> >> 2024-09-17T12:19:42.212+0000 7fea25b6f9c0 0 check_disk_state(): removing
>>>> >> manifest part from index:
>>>> >> mimir-prod:_multipart_tenant_prod/01J7Q778YXJXE23SRQZM9ZA4NH/chunks/000001.2~m6EI5fHFWxI-RmWB6TeFSupu7vVrCgh.3
>>>> >> 2024-09-17T12:19:42.212+0000 7fea25b6f9c0 0 check_disk_state(): removing
>>>> >> manifest part from index:
>>>> >> mimir-prod:_multipart_tenant_prod/01J7Q778YXJXE23SRQZM9ZA4NH/chunks/000001.2~m6EI5fHFWxI-RmWB6TeFSupu7vVrCgh.4
>>>> >> 2024-09-17T12:19:42.212+0000 7fea25b6f9c0 0 check_disk_state(): removing
>>>> >> manifest part from index:
>>>> >> mimir-prod:_multipart_tenant_prod/01J7Q778YXJXE23SRQZM9ZA4NH/chunks/000001.2~m6EI5fHFWxI-RmWB6TeFSupu7vVrCgh.5
>>>> >> 2024-09-17T12:19:42.213+0000 7fea25b6f9c0 0 check_disk_state(): removing
>>>> >> manifest part from index:
>>>> >> mimir-prod:_multipart_tenant_prod/01J7Q778YXJXE23SRQZM9ZA4NH/chunks/000001.2~m6EI5fHFWxI-RmWB6TeFSupu7vVrCgh.6

>>>> > but the end result shows nothing has changed.

>>>> >> "check_result": {
>>>> >> "existing_header": {
>>>> >> "usage": {
>>>> >> "rgw.main": {
>>>> >> "size": 4281119287051,
>>>> >> "size_actual": 4281159110656,
>>>> >> "size_utilized": 4281119287051,
>>>> >> "size_kb": 4180780554,
>>>> >> "size_kb_actual": 4180819444,
>>>> >> "size_kb_utilized": 4180780554,
>>>> >> "num_objects": 36429
>>>> >> },
>>>> >> "rgw.multimeta": {
>>>> >> "size": 0,
>>>> >> "size_actual": 0,
>>>> >> "size_utilized": 66636,
>>>> >> "size_kb": 0,
>>>> >> "size_kb_actual": 0,
>>>> >> "size_kb_utilized": 66,
>>>> >> "num_objects": 2468
>>>> >> }
>>>> >> }
>>>> >> },
>>>> >> "calculated_header": {
>>>> >> "usage": {
>>>> >> "rgw.main": {
>>>> >> "size": 4281119287051,
>>>> >> "size_actual": 4281159110656,
>>>> >> "size_utilized": 4281119287051,
>>>> >> "size_kb": 4180780554,
>>>> >> "size_kb_actual": 4180819444,
>>>> >> "size_kb_utilized": 4180780554,
>>>> >> "num_objects": 36429
>>>> >> },
>>>> >> "rgw.multimeta": {
>>>> >> "size": 0,
>>>> >> "size_actual": 0,
>>>> >> "size_utilized": 66636,
>>>> >> "size_kb": 0,
>>>> >> "size_kb_actual": 0,
>>>> >> "size_kb_utilized": 66,
>>>> >> "num_objects": 2468
>>>> >> }
>>>> >> }
>>>> >> }
>>>> >> }

>>>> > Does this command do anything? Is it the wrong command for this issue? How
>>>> > does one go about fixing buckets in this state?

>>>> > Thanks!

>>>> > Reid
>>>> > _______________________________________________
>>>> > ceph-users mailing list -- [ mailto:ceph-users@xxxxxxx | ceph-users@xxxxxxx ]
>>>>> To unsubscribe send an email to [ mailto:ceph-users-leave@xxxxxxx |
>>>> > ceph-users-leave@xxxxxxx ]
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx