Re: RGW multisite sync data sync shard stuck

ceph.novice@xxxxxxxxxxxxxxxx · Sun, 4 Jun 2017 22:22:13 +0200

Hi Andreas.

Well, we do _NOT_ need multiside in our environment, but unfortunately is is the basis for the announced "metasearch", based on ElasticSearch... so we try to implement a "multisite" config on Kraken (v11.2.0) since weeks, but never succeeded so far. We have purged and started all over with the multiside config for about ~5x by now.

We have one CEPH cluster with two RadosGW's on top (so NOT two CEPH cluster!), not sure if this makes a difference!?

Can you please share some infos about your (former working?!?) setup? Like
- which CEPH version are you on
- old deprecated "federated" or "new from Jewel" multiside setup
- one or multiple CEPH clusters

Great to see that multisite seems to work somehow somewhere. We were really in doubt :O

Thanks & regards
 Anton

P.S.: If someone reads this, who has a working "one Kraken CEPH cluster" based multisite setup (or, let me dream, even a working ElasticSearch setup :| ) please step out of the dark and enlighten us :O

Gesendet: Dienstag, 30. Mai 2017 um 11:02 Uhr
Von: "Andreas Calminder" <andreas.calminder@xxxxxxxxxx>
An: ceph-users@xxxxxxxxxxxxxx
Betreff:  RGW multisite sync data sync shard stuck
Hello,
I've got a sync issue with my multisite setup. There's 2 zones in 1
zone group in 1 realm. The data sync in the non-master zone have stuck
on Incremental sync is behind by 1 shard, this wasn't noticed until
the radosgw instances in the master zone started dying from out of
memory issues, all radosgw instances in the non-master zone was then
shutdown to ensure services in the master zone while trying to
troubleshoot the issue.

>From the rgw logs in the master zone I see entries like:

2017-05-29 16:10:34.717988 7fbbc1ffb700 0 ERROR: failed to sync
object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6.2374181.27/dirname_1/dirname_2/filename_1.ext
2017-05-29 16:10:34.718016 7fbbc1ffb700 0 ERROR: failed to sync
object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6.2374181.27/dirname_1/dirname_2/filename_2.ext
2017-05-29 16:10:34.718504 7fbbc1ffb700 0 ERROR: failed to fetch
remote data log info: ret=-5
2017-05-29 16:10:34.719443 7fbbc1ffb700 0 ERROR: a sync operation
returned error
2017-05-29 16:10:34.720291 7fbc167f4700 0 store->fetch_remote_obj()
returned r=-5

sync status in the non-master zone reports that the metadata is up to
sync and that the data sync is behind on 1 shard and that the oldest
incremental change not applied is about 2 weeks back.

I'm not quite sure how to proceed, is there a way to find out the id
of the shard and force some kind of re-sync of the data in it from the
master zone? I'm unable to have the non-master zone rgw's running
because it'll leave the master zone in a bad state with rgw dying
every now and then.

Regards,
Andreas
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com