On Tue, Sep 23, 2014 at 4:54 PM, Craig Lewis <clewis at centraldesktop.com> wrote: > I've had some issues in my secondary cluster. I'd like to restart > replication from the beginning, without destroying the data in the secondary > cluster. > > Reading the radosgw-agent and Admin REST API code, I believe I just need to > stop replication, delete the secondary zone's log_pool, recreate the > log_pool, and restart replication. > > Anybody have any thoughts? I'm still setting up some VMs to test this, > before I try it in production. > > > > Background: > I'm on Emperor (yeah, still need to upgrade). I believe I ran into > http://tracker.ceph.com/issues/7595 . My read of that patch is that it > prevents the problem from occurring, but doesn't correct corrupt data. I > tried applying some of the suggested patches, but they only ignored the > error, rather than correcting it. I finally dropped the corrupt pool. That > allowed the stock Emperor binaries to run without crashing. The pool I > dropped was my secondary zone's log_pool. > > Before I dropped the pool, I copied all of the objects to local disk. After > re-creating the pool, I uploaded the objects. > > Now replication is kinda of working, but not correctly. I have a number of > buckets that are being written to in the primary cluster, but no replication > is occurring. radosgw-agent says a number of shards have >= 1000 log > entries, but then it never processes the buckets in those shards. > > Looking back at the pool's contents on local disk, all of the files are 0 > bytes. So I'm assuming all of the important state was stored in the > object's metadata. > > I'd like to completely zero out the replication state, then exploit a > feature in radosgw-agent 1.1 that will only replicate the first 1000 objects > in buckets, if the bucket isn't being actively written to. Then I can > restart radosgw-agent 1.2, and let it catch up the active buckets. That'll > save me many weeks and TB of replication. > > Obviously, I'll compare bucket listings between the two clusters when I'm > done. I'll probably try to catch up the read-only bucket's state at a later > date. > I don't really understand what happened here. Maybe start with trying to understand why the sync agent isn't replicating anymore? Maybe the replicalog markers are off? Yehuda