Re: RGW multisite sync data sync shard stuck

David Turner <drakonstein@xxxxxxxxx> · Thu, 24 Aug 2017 18:34:46 +0000

Andreas, did you find a solution to your multisite sync issues with the stuck shards?  I'm also on 10.2.7 and having this problem.  One realm has stuck shards for data sync and another realm says it's up to date, but isn't receiving new users via metadata sync.  I ran metadata sync init on it and it had all up to date metadata information when it finished, but then new users weren't synced again.  I don't know what to do  to get these working stably.  There are 2 RGW's for each realm in each zone in master/master allowing data to sync in both directions.

On Mon, Jun 5, 2017 at 3:05 AM Andreas Calminder <andreas.calminder@xxxxxxxxxx> wrote:
Hello,

I'm using Ceph jewel (10.2.7) and as far as I know I'm using the jewel

multisite setup (multiple zones) as described here

http://docs.ceph.com/docs/master/radosgw/multisite/ and two ceph

clusters, one in each site. Stretching clusters over multiple sites

are seldom/never worth the hassle in my opinion. The reason the

replication ended up in a bad state, seems to be a mix of multiple

issues, first it's that if you shove a lot of objects into a bucket

+1M the bucket index starts to drag the rados gateways down, there's

also some kind of memory leak in rgw when the sync has failed

http://tracker.ceph.com/issues/19446, causing the rgw daemons to die

left and right due to out of memory errors and some times also other

parts of the system would be dragged down with them.

On 4 June 2017 at 22:22,  <ceph.novice@xxxxxxxxxxxxxxxx> wrote:

> Hi Andreas.

>

> Well, we do _NOT_ need multiside in our environment, but unfortunately is is the basis for the announced "metasearch", based on ElasticSearch... so we try to implement a "multisite" config on Kraken (v11.2.0) since weeks, but never succeeded so far. We have purged and started all over with the multiside config for about ~5x by now.

>

> We have one CEPH cluster with two RadosGW's on top (so NOT two CEPH cluster!), not sure if this makes a difference!?

>

> Can you please share some infos about your (former working?!?) setup? Like

> - which CEPH version are you on

> - old deprecated "federated" or "new from Jewel" multiside setup

> - one or multiple CEPH clusters

>

> Great to see that multisite seems to work somehow somewhere. We were really in doubt :O

>

> Thanks & regards

>  Anton

>

> P.S.: If someone reads this, who has a working "one Kraken CEPH cluster" based multisite setup (or, let me dream, even a working ElasticSearch setup :| ) please step out of the dark and enlighten us :O

>

> Gesendet: Dienstag, 30. Mai 2017 um 11:02 Uhr

> Von: "Andreas Calminder" <andreas.calminder@xxxxxxxxxx>

> An: ceph-users@xxxxxxxxxxxxxx

> Betreff:  RGW multisite sync data sync shard stuck

> Hello,

> I've got a sync issue with my multisite setup. There's 2 zones in 1

> zone group in 1 realm. The data sync in the non-master zone have stuck

> on Incremental sync is behind by 1 shard, this wasn't noticed until

> the radosgw instances in the master zone started dying from out of

> memory issues, all radosgw instances in the non-master zone was then

> shutdown to ensure services in the master zone while trying to

> troubleshoot the issue.

>

> From the rgw logs in the master zone I see entries like:

>

> 2017-05-29 16:10:34.717988 7fbbc1ffb700 0 ERROR: failed to sync

> object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6.2374181.27/dirname_1/dirname_2/filename_1.ext

> 2017-05-29 16:10:34.718016 7fbbc1ffb700 0 ERROR: failed to sync

> object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6.2374181.27/dirname_1/dirname_2/filename_2.ext

> 2017-05-29 16:10:34.718504 7fbbc1ffb700 0 ERROR: failed to fetch

> remote data log info: ret=-5

> 2017-05-29 16:10:34.719443 7fbbc1ffb700 0 ERROR: a sync operation

> returned error

> 2017-05-29 16:10:34.720291 7fbc167f4700 0 store->fetch_remote_obj()

> returned r=-5

>

> sync status in the non-master zone reports that the metadata is up to

> sync and that the data sync is behind on 1 shard and that the oldest

> incremental change not applied is about 2 weeks back.

>

> I'm not quite sure how to proceed, is there a way to find out the id

> of the shard and force some kind of re-sync of the data in it from the

> master zone? I'm unable to have the non-master zone rgw's running

> because it'll leave the master zone in a bad state with rgw dying

> every now and then.

>

> Regards,

> Andreas

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

--

Andreas Calminder

System Administrator

IT Operations Core Services

Klarna AB (publ)

Sveavägen 46, 111 34 Stockholm

Tel: +46 8 120 120 00

Reg no: 556737-0431

klarna.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com