[RGW] multisite sync, stall recovering shards

Gilles Mocellin <gilles.mocellin@xxxxxxxxxxxxxx> · Thu, 19 Dec 2024 13:08:07 +0000

Hi there !

Being suspicious about replication between two clusters,
I've done a radosgw-admin data sync init on the secondary zone.    
Since then, after a lot of activity, I'm stuck with recovering     
shards, nothing moves. Incremental sync still work.
Wondering if I had a bad state also on the primary, I also did a data sync init on
the primary...
And now, it's also stuck with recovering shard !                                                                  

In sync error list, I can find some "failed to sync bucket         
instance: (125) Operation canceled" errors                         

I also tried to rewrite some buckets chown in those errors, but   
nothing changes. Strange, in those errors, the objects names are   
not real objects, example : "name": "replic_cfn_rec/cfb0047:aefd400
3-1866-4b16-b1b3-2f308075cd1c.20298566.4:11[0]"                    

I wonder what is this ending ":10[0]".

I also tried to remove stale instances, but nothing.
I've still not retry a data sync init on secondary, perhaps I should, but the generated activity is impactfull.

Can we reduce that resync activity priority ?

ah, My primary cluster is on Reef 18.2.4, the secondary still on 18.2.2 (needs OS upgrade, Ubuntu 18.04).

--  
Gilles

--  
Gilles
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx