Re: RGW Multisite metadata sync init

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ok, I've been testing, investigating, researching, etc for the last week and I don't have any problems with data syncing.  The clients on one side are creating multipart objects while the multisite sync is creating them as whole objects and one of the datacenters is slower at cleaning up the shadow files.  That's the big discrepancy between object counts in the pools between datacenters.  I created a tool that goes through for each bucket in a realm and does a recursive listing of all objects in it for both datacenters and compares the 2 lists for any differences.  The data is definitely in sync between the 2 datacenters down to the modified time and byte of each file in s3.

The metadata is still not syncing for the other realm, though.  If I run `metadata sync init` then the second datacenter will catch up with all of the new users, but until I do that newly created users on the primary side don't exist on the secondary side.  `metadata sync status`, `sync status`, `metadata sync run` (only left running for 30 minutes before I ctrl+c it), etc don't show any problems... but the new users just don't exist on the secondary side until I run `metadata sync init`.  I created a new bucket with the new user and the bucket shows up in the second datacenter, but no objects because the objects don't have a valid owner.

Thank you all for the help with the data sync issue.  You pushed me into good directions.  Does anyone have any insight as to what is preventing the metadata from syncing in the other realm?  I have 2 realms being sync using multi-site and it's only 1 of them that isn't getting the metadata across.  As far as I can tell it is configured identically.

On Thu, Aug 31, 2017 at 12:46 PM David Turner <drakonstein@xxxxxxxxx> wrote:
All of the messages from sync error list are listed below.  The number on the left is how many times the error message is found.

   1811                     "message": "failed to sync bucket instance: (16) Device or resource busy"
      7                     "message": "failed to sync bucket instance: (5) Input\/output error"
     65                     "message": "failed to sync object"

On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman <owasserm@xxxxxxxxxx> wrote:

Hi David,

On Mon, Aug 28, 2017 at 8:33 PM, David Turner <drakonstein@xxxxxxxxx> wrote:
The vast majority of the sync error list is "failed to sync bucket instance: (16) Device or resource busy".  I can't find anything on Google about this error message in relation to Ceph.  Does anyone have any idea what this means? and/or how to fix it?

Those are intermediate errors resulting from several radosgw trying to acquire the same sync log shard lease. It doesn't effect the sync progress.
Are there any other errors?

Orit

On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:

Hi David,

The 'data sync init' command won't touch any actual object data, no. Resetting the data sync status will just cause a zone to restart a full sync of the --source-zone's data changes log. This log only lists which buckets/shards have changes in them, which causes radosgw to consider them for bucket sync. So while the command may silence the warnings about data shards being behind, it's unlikely to resolve the issue with missing objects in those buckets.

When data sync is behind for an extended period of time, it's usually because it's stuck retrying previous bucket sync failures. The 'sync error list' may help narrow down where those failures are.

There is also a 'bucket sync init' command to clear the bucket sync status. Following that with a 'bucket sync run' should restart a full sync on the bucket, pulling in any new objects that are present on the source-zone. I'm afraid that those commands haven't seen a lot of polish or testing, however.

Casey


On 08/24/2017 04:15 PM, David Turner wrote:
Apparently the data shards that are behind go in both directions, but only one zone is aware of the problem.  Each cluster has objects in their data pool that the other doesn't have.  I'm thinking about initiating a `data sync init` on both sides (one at a time) to get them back on the same page.  Does anyone know if that command will overwrite any local data that the zone has that the other doesn't if you run `data sync init` on it?

On Thu, Aug 24, 2017 at 1:51 PM David Turner <drakonstein@xxxxxxxxx> wrote:
After restarting the 2 RGW daemons on the second site again, everything caught up on the metadata sync.  Is there something about having 2 RGW daemons on each side of the multisite that might be causing an issue with the sync getting stale?  I have another realm set up the same way that is having a hard time with its data shards being behind.  I haven't told them to resync, but yesterday I noticed 90 shards were behind.  It's caught back up to only 17 shards behind, but the oldest change not applied is 2 months old and no order of restarting RGW daemons is helping to resolve this.

On Thu, Aug 24, 2017 at 10:59 AM David Turner <drakonstein@xxxxxxxxx> wrote:
I have a RGW Multisite 10.2.7 set up for bi-directional syncing.  This has been operational for 5 months and working fine.  I recently created a new user on the master zone, used that user to create a bucket, and put in a public-acl object in there.  The Bucket created on the second site, but the user did not and the object errors out complaining about the access_key not existing.

That led me to think that the metadata isn't syncing, while bucket and data both are.  I've also confirmed that data is syncing for other buckets as well in both directions. The sync status from the second site was this.

  1.   metadata sync syncing
  2.                 full sync: 0/64 shards
  3.                 incremental sync: 64/64 shards
  4.                 metadata is caught up with master
  5.       data sync source: f4c12327-4721-47c9-a365-86332d84c227 (public-atl01)
  6.                         syncing
  7.                         full sync: 0/128 shards
  8.                         incremental sync: 128/128 shards
  9.                         data is caught up with source

Sync status leads me to think that the second site believes it is up to date, even though it is missing a freshly created user.  I restarted all of the rgw daemons for the zonegroup, but it didn't trigger anything to fix the missing user in the second site.  I did some googling and found the sync init commands mentioned in a few ML posts and used metadata sync init and now have this as the sync status.

  1.   metadata sync preparing for full sync
  2.                 full sync: 64/64 shards
  3.                 full sync: 0 entries to sync
  4.                 incremental sync: 0/64 shards
  5.                 metadata is behind on 70 shards
  6.                 oldest incremental change not applied: 2017-03-01 21:13:43.0.126971s
  7.       data sync source: f4c12327-4721-47c9-a365-86332d84c227 (public-atl01)
  8.                         syncing
  9.                         full sync: 0/128 shards
  10.                         incremental sync: 128/128 shards
  11.                         data is caught up with source

It definitely triggered a fresh sync and told it to forget about what it's previously applied as the date of the oldest change not applied is the day we initially set up multisite for this zone.  The problem is that was over 12 hours ago and the sync stat hasn't caught up on any shards yet.

Does anyone have any suggestions other than blast the second site and set it back up with a fresh start (the only option I can think of at this point)?

Thank you,
David Turner


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux