Multisite stuck data shard recovery after bucket deletion

Adam Prycki <aprycki@xxxxxxxxxxxxx> · Wed, 17 Jul 2024 15:37:20 +0200

Hello,

I'm testing multisite sync on reef 18.2.2, cephadm and ubuntu 22.04.

Right now I'm testing symmetrical sync policy making backup to read-only 
zone.
My sync policy allows for replication and I enable replication via 
put-bucket-replication.

My multisite setup fails at seemingly basic operation.

My test looks like this:
1. create bucket
2. upload some data to bucket
3. wait for replication to copy some of the data
4. run `rclone purge` on the bucket in master zone while replication is 
in progress. All data and bucket itself are deleted.

I've tested this on normal secondary zone and archive zone.

It seems that bucket is deleted so quickly that replication gets stuck.
Buckets are gone from both zones but data sync shard still tries to 
replicate them

Example of a recovering shard.

{
    "shard_id": 100,
    "marker": {
        "status": "full-sync",
        "marker": "",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "0.000000"
    },
    "pending_buckets": [
        "bucket6:58642236-4f66-46f5-b863-1d6a8667c4c3.61059.5:9",
        "bucket6:58642236-4f66-46f5-b863-1d6a8667c4c3.61059.7:9"
    ],
    "recovering_buckets": [
        "bucket6:58642236-4f66-46f5-b863-1d6a8667c4c3.61059.7:9[0]"
    ],
    "current_time": "2024-07-17T13:23:11Z"
}

In this case there are 2 pending buckets because I've reused the bucket 
name.

The only semi-automatic solution I've found is to recreate bucket with 
the same name and wait for recovering shards to disappear.

Is there any way to make ceph clean up these stuck shards automatically?

Best regards
Adam Prycki
Attachment:
smime.p7s

Description: Kryptograficzna sygnatura S/MIME
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx