Re: RGW multisite sync, data sync issues

Yehuda Sadeh-Weinraub <ysadehwe@xxxxxxxxxx> · Thu, 1 Jun 2017 09:00:52 -0700

On Wed, May 31, 2017 at 6:49 AM, Andreas Calminder
<andreas.calminder@xxxxxxxxxx> wrote:
> Hello,
> Asked on ceph-users, thought I post here as well, if anyone knows the
> ins and outs of rgw.
> I've got a sync issue with my multisite setup. There's 2 zones in 1
> zone group in 1 realm. The data sync in the non-master zone have stuck
> on Incremental sync is behind by 1 shard, this wasn't noticed until
> the radosgw instances in the master zone started dying from out of
> memory issues, all radosgw instances in the non-master zone was then
> shutdown to ensure services in the master zone while trying to
> troubleshoot the issue.
>
> From the rgw logs in the master zone I see entries like:
>
> 2017-05-29 16:10:34.717988 7fbbc1ffb700  0 ERROR: failed to sync
> object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6.2374181.27/dirname_1/dirname_2/filename_1.ext
> 2017-05-29 16:10:34.718016 7fbbc1ffb700  0 ERROR: failed to sync
> object: 12354/BUCKETNAME:be8fa19b-ad79-4cd8-ac7b-1e14fdc882f6.2374181.27/dirname_1/dirname_2/filename_2.ext
> 2017-05-29 16:10:34.718504 7fbbc1ffb700  0 ERROR: failed to fetch
> remote data log info: ret=-5
> 2017-05-29 16:10:34.719443 7fbbc1ffb700  0 ERROR: a sync operation
> returned error
> 2017-05-29 16:10:34.720291 7fbc167f4700  0 store->fetch_remote_obj()
> returned r=-5
>
> sync status in the non-master zone reports that the metadata is up to
> sync and that the data sync is behind on 1 shard and that the oldest
> incremental change not applied is about 2 weeks back.
>
> I'm not quite sure how to proceed, is there a way to find out the id
> of the shard and force some kind of re-sync of the data in it from the
> master zone? I'm unable to have the non-master zone rgw's running
> because it'll leave the master zone in a bad state with rgw dying
> every now and then.
>

Maybe start with looking at the sync error log:

$ radosgw-admin sync error list

Then there are radosgw-admin commands that query the different logs
statuses, and the different sync statuses. E.g.,

$ radosgw-admin bilog status
$ radosgw-admin datalog status
$ radosgw-admin mdlog status

and

$ radosgw-admin bucket sync status
$ radosgw-admin data sync status
$ radosgw-admin metadata sync status

All commands need extra params that specifies the specific resource
you're aiming at (e.g., which bucket, which data shard). You probably
don't need to deal with the metadata sync. The log status commands
should be run on the source zone, and the sync status on the
destination.

You can trigger a full resync on the various entities by the following commands:

$ radosgw-admin bucket sync init
$ radosgw-admin data sync init
$ radosgw-admin metadata sync init

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html