Re: RGW multisite sync data sync shard stuck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi David,
I never solved this issue as I couldn't figure out what was wrong. I just went ahead and removed the second site and will proceed to setup a new multisite whenever luminous is out and hoping the weirdness has been sorted.

Sorry I didn't have any good answers :/

/andreas

On 24 Aug 2017 20:35, "David Turner" <drakonstein@xxxxxxxxx> wrote:
Andreas, did you find a solution to your multisite sync issues with the stuck shards?  I'm also on 10.2.7 and having this problem.  One realm has stuck shards for data sync and another realm says it's up to date, but isn't receiving new users via metadata sync.  I ran metadata sync init on it and it had all up to date metadata information when it finished, but then new users weren't synced again.  I don't know what to do  to get these working stably.  There are 2 RGW's for each realm in each zone in master/master allowing data to sync in both directions.

On Mon, Jun 5, 2017 at 3:05 AM Andreas Calminder <andreas.calminder@xxxxxxxxxx> wrote:
Hello,
I'm using Ceph jewel (10.2.7) and as far as I know I'm using the jewel
multisite setup (multiple zones) as described here
http://docs.ceph.com/docs/master/radosgw/multisite/ and two ceph
clusters, one in each site. Stretching clusters over multiple sites
are seldom/never worth the hassle in my opinion. The reason the
replication ended up in a bad state, seems to be a mix of multiple
issues, first it's that if you shove a lot of objects into a bucket
+1M the bucket index starts to drag the rados gateways down, there's
also some kind of memory leak in rgw when the sync has failed
http://tracker.ceph.com/issues/19446, causing the rgw daemons to die
left and right due to out of memory errors and some times also other
parts of the system would be dragged down with them.

On 4 June 2017 at 22:22,  <ceph.novice@xxxxxxxxxxxxxxxx> wrote:
> Hi Andreas.
>
> Well, we do _NOT_ need multiside in our environment, but unfortunately is is the basis for the announced "metasearch", based on ElasticSearch... so we try to implement a "multisite" config on Kraken (v11.2.0) since weeks, but never succeeded so far. We have purged and started all over with the multiside config for about ~5x by now.
>
> We have one CEPH cluster with two RadosGW's on top (so NOT two CEPH cluster!), not sure if this makes a difference!?
>
> Can you please share some infos about your (former working?!?) setup? Like
> - which CEPH version are you on
> - old deprecated "federated" or "new from Jewel" multiside setup
> - one or multiple CEPH clusters
>
> Great to see that multisite seems to work somehow somewhere. We were really in doubt :O
>
> Thanks & regards
>  Anton
>
> P.S.: If someone reads this, who has a working "one Kraken CEPH cluster" based multisite setup (or, let me dream, even a working ElasticSearch setup :| ) please step out of the dark and enlighten us :O
>
> Gesendet: Dienstag, 30. Mai 2017 um 11:02 Uhr
> Von: "Andreas Calminder" <andreas.calminder@xxxxxxxxxx>
> An: ceph-users@xxxxxxxxxxxxxx
> Betreff: RGW multisite sync data sync shard stuck
> Hello,
> I've got a sync issue with my multisite setup. There's 2 zones in 1
> zone group in 1 realm. The data sync in the non-master zone have stuck
> on Incremental sync is behind by 1 shard, this wasn't noticed until
> the radosgw instances in the master zone started dying from out of
> memory issues, all radosgw instances in the non-master zone was then
> shutdown to ensure services in the master zone while trying to
> troubleshoot the issue.
>
> From the rgw logs in the master zone I see entries like:
>
> 2017-05-29 16:10:34.717988 7fbbc1ffb700 0 ERROR: failed to sync
> object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6.2374181.27/dirname_1/dirname_2/filename_1.ext
> 2017-05-29 16:10:34.718016 7fbbc1ffb700 0 ERROR: failed to sync
> object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6.2374181.27/dirname_1/dirname_2/filename_2.ext
> 2017-05-29 16:10:34.718504 7fbbc1ffb700 0 ERROR: failed to fetch
> remote data log info: ret=-5
> 2017-05-29 16:10:34.719443 7fbbc1ffb700 0 ERROR: a sync operation
> returned error
> 2017-05-29 16:10:34.720291 7fbc167f4700 0 store->fetch_remote_obj()
> returned r=-5
>
> sync status in the non-master zone reports that the metadata is up to
> sync and that the data sync is behind on 1 shard and that the oldest
> incremental change not applied is about 2 weeks back.
>
> I'm not quite sure how to proceed, is there a way to find out the id
> of the shard and force some kind of re-sync of the data in it from the
> master zone? I'm unable to have the non-master zone rgw's running
> because it'll leave the master zone in a bad state with rgw dying
> every now and then.
>
> Regards,
> Andreas
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>



--
Andreas Calminder
System Administrator
IT Operations Core Services

Klarna AB (publ)
Sveavägen 46, 111 34 Stockholm
Tel: +46 8 120 120 00
Reg no: 556737-0431
klarna.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux