Multisite sync stopped working, 1 shards are recovering

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Public



Hi All,

 

We are running a multisite setup running luminous on bluestore.  This setup worked perfectly since installation around the time 12.2.2 came out.  A few days ago we upgraded the clusters from 12.2.5 to 12.2.7 and now we noticed one of the clusters does not sync anymore.

 

Primary site:

# radosgw-admin sync status

          realm 87f16146-0729-4b37-9462-dd5e6d97b427 (pro)

      zonegroup 9fad4a8d-9a7b-4649-a54a-856450635808 (be)

           zone 4ed07bb2-a80b-4c69-aa15-fdc17ae6f5f2 (bccm-pro)

  metadata sync no sync (zone is master)

      data sync source: ad420c46-3ef3-430a-afef-bff78e26d410 (bccl-pro)

                        syncing

                        full sync: 0/128 shards

                        incremental sync: 128/128 shards

                        data is caught up with source

 

secondary site:

# radosgw-admin sync status

          realm 87f16146-0729-4b37-9462-dd5e6d97b427 (pro)

      zonegroup 9fad4a8d-9a7b-4649-a54a-856450635808 (be)

           zone ad420c46-3ef3-430a-afef-bff78e26d410 (bccl-pro)

  metadata sync syncing

                full sync: 0/64 shards

                incremental sync: 64/64 shards

                metadata is caught up with master

      data sync source: 4ed07bb2-a80b-4c69-aa15-fdc17ae6f5f2 (bccm-pro)

                        syncing

                        full sync: 80/128 shards

                        full sync: 4 buckets to sync

                        incremental sync: 48/128 shards

                        data is behind on 80 shards

                        behind shards: [0,6,25,44,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,115,116,117,118,119,120,121,122,123,124,125,126,127]

                        1 shards are recovering

                        recovering shards: [27]

 

When we first noticed the problem only 3 shards were behind and shard 27 was recovering.  One of the things we did was a radosgw-admin data sync init in an attempt to have everything syncing again.  Since then, 48 shards seem to have done the full sync and are now incremental syncing, but the rest just stays like this.  It seems the recovering of shard 27 blocks the rest of the sync?

 

Radosgw-admin sync error list shows a number of errors from during the upgrade, mostly “failed to sync object(5) Input/output error” and “failed to sync bucket instance: (5) Input/output error”.  Does this mean radosgw was unable to write to the pool?

 

# radosgw-admin data sync status --shard-id 27 --source-zone bccm-pro

{

    "shard_id": 27,

    "marker": {

        "status": "incremental-sync",

        "marker": "1_1534494893.816775_131867195.1",

        "next_step_marker": "",

        "total_entries": 1,

        "pos": 0,

        "timestamp": "0.000000"

    },

    "pending_buckets": [],

    "recovering_buckets": [

        "pro-registry:4ed07bb2-a80b-4c69-aa15-fdc17ae6f5f2.314303.1:26"

    ]

}

 

 

How can we recover shard 27?  Any ideas on how we get this multisite setup healthy again?  I wanted to create an issue in the tracker for this, but it seems a normal user does not have permissions to do this anymore?

 

Many Thanks

 

Dieter

 

 

 

 



Disclaimer
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux