Re: RGW Multisite metadata sync init

David Turner <drakonstein@xxxxxxxxx> · Thu, 31 Aug 2017 16:46:00 +0000

All of the messages from sync error list are listed below.  The number on the left is how many times the error message is found.
   1811                     "message": "failed to sync bucket instance: (16) Device or resource busy"
      7                     "message": "failed to sync bucket instance: (5) Input\/output error"
     65                     "message": "failed to sync object"

On Tue, Aug 29, 2017 at 10:00 AM Orit Wasserman <owasserm@xxxxxxxxxx> wrote:

Hi David,

On Mon, Aug 28, 2017 at 8:33 PM, David Turner <drakonstein@xxxxxxxxx> wrote:
The vast majority of the sync error list is "failed to sync bucket instance: (16) Device or resource busy".  I can't find anything on Google about this error message in relation to Ceph.  Does anyone have any idea what this means? and/or how to fix it?

Those are intermediate errors resulting from several radosgw trying to acquire the same sync log shard lease. It doesn't effect the sync progress.
Are there any other errors?

Orit

On Fri, Aug 25, 2017 at 2:48 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:

    Hi David,
    The 'data sync init' command won't touch any actual object data,
      no. Resetting the data sync status will just cause a zone to
      restart a full sync of the --source-zone's data changes log. This
      log only lists which buckets/shards have changes in them, which
      causes radosgw to consider them for bucket sync. So while the
      command may silence the warnings about data shards being behind,
      it's unlikely to resolve the issue with missing objects in those
      buckets.
    When data sync is behind for an extended period of time, it's
      usually because it's stuck retrying previous bucket sync failures.
      The 'sync error list' may help narrow down where those failures
      are.

    There is also a 'bucket sync init' command to clear the bucket
      sync status. Following that with a 'bucket sync run' should
      restart a full sync on the bucket, pulling in any new objects that
      are present on the source-zone. I'm afraid that those commands
      haven't seen a lot of polish or testing, however.
    Casey

    On 08/24/2017 04:15 PM, David Turner
      wrote:

      Apparently the data shards that are behind go in
        both directions, but only one zone is aware of the problem. 
        Each cluster has objects in their data pool that the other
        doesn't have.  I'm thinking about initiating a `data sync init`
        on both sides (one at a time) to get them back on the same
        page.  Does anyone know if that command will overwrite any local
        data that the zone has that the other doesn't if you run `data
        sync init` on it?

        On Thu, Aug 24, 2017 at 1:51 PM David Turner <drakonstein@xxxxxxxxx>
          wrote:

          After restarting the 2 RGW daemons on the
            second site again, everything caught up on the metadata
            sync.  Is there something about having 2 RGW daemons on each
            side of the multisite that might be causing an issue with
            the sync getting stale?  I have another realm set up the
            same way that is having a hard time with its data shards
            being behind.  I haven't told them to resync, but yesterday
            I noticed 90 shards were behind.  It's caught back up to
            only 17 shards behind, but the oldest change not applied is
            2 months old and no order of restarting RGW daemons is
            helping to resolve this.

            On Thu, Aug 24, 2017 at 10:59 AM David Turner
              <drakonstein@xxxxxxxxx>
              wrote:

              I have a RGW Multisite 10.2.7 set up for
                bi-directional syncing.  This has been operational for 5
                months and working fine.  I recently created a new user
                on the master zone, used that user to create a bucket,
                and put in a public-acl object in there.  The Bucket
                created on the second site, but the user did not and the
                object errors out complaining about the access_key not
                existing.

                That led me to think that the metadata isn't
                  syncing, while bucket and data both are.  I've also
                  confirmed that data is syncing for other buckets as
                  well in both directions. The sync status from the
                  second site was this.

                        metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: f4c12327-4721-47c9-a365-86332d84c227 (public-atl01)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

                Sync status leads me to think that the second site
                  believes it is up to date, even though it is missing a
                  freshly created user.  I restarted all of the rgw
                  daemons for the zonegroup, but it didn't trigger
                  anything to fix the missing user in the second site. 
                  I did some googling and found the sync init commands
                  mentioned in a few ML posts and used metadata sync
                  init and now have this as the sync status.

                      metadata sync preparing for full sync
                full sync: 64/64 shards
                full sync: 0 entries to sync
                incremental sync: 0/64 shards
                metadata is behind on 70 shards
                oldest incremental change not applied: 2017-03-01 21:13:43.0.126971s
      data sync source: f4c12327-4721-47c9-a365-86332d84c227 (public-atl01)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

                It definitely triggered a fresh sync and told it to
                  forget about what it's previously applied as the date
                  of the oldest change not applied is the day we
                  initially set up multisite for this zone.  The problem
                  is that was over 12 hours ago and the sync stat hasn't
                  caught up on any shards yet.

                Does anyone have any suggestions other than blast
                  the second site and set it back up with a fresh start
                  (the only option I can think of at this point)?

                Thank you,
                David Turner

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com