For the record, I have one bucket in my
slave zone that caught up to the master zone. I stopped adding
new data to my first bucket, and replication stopped. I started
tickling the bucket by uploading and deleting a 0 byte file every
5 minutes. Now the slave has all of the files in that bucket.
I didn't need to use --sync-scope=full .
I'm still importing faster than I can replicate, but I know how to
deal with it now. The master zone has nearly completed it's
import. Once that happens, replication should be able to catch up
in a couple of weeks, and stay caught up.
Thanks for all the help!
On 2/7/14 11:38 , Craig Lewis wrote:
I have confirmed this in production,
with the default max-entries.
I have a bucket that I'm no longer writing to. Radosgw-agent
had stopped replicating this bucket. radosgw-admin bucket stats
shows that the slave is missing ~600k objects.
I uploaded a 1 byte file to the bucket. On the next pass,
radosgw-agent replicated 1000 entries.
I'm uploading and deleting the same file every 5 minutes. I'm
using more inter-colo bandwidth now. This bucket is catching
up, slowly.
For now, I'm going to graph the delta of the total number of
objects in both clusters. If the slave is higher, it's catching
up. If it's lower, it's falling behind.
On 2/6/14 18:32 , Craig Lewis wrote:
On 2/4/14 17:06 , Craig Lewis
wrote:
Now that I've started seeing missing objects, I'm not able to
download objects that should be on the slave if replication is
up to date. Either it's not up to date, or it's skipping
objects every pass.
Using my --max-entries fix (https://github.com/ceph/radosgw-agent/pull/8),
I think I see what's happening.
Shut down replication
Upload 6 objects to an empty bucket on the master:
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg
2014-02-07 02:03 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg
2014-02-07 02:03 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg
None show on the slave, because replication is down.
Start radosgw-agent --max-entries=2 (1 doesn't seem to replicate
anything)
Check contents of slave after pass #1:
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg
Check contents of slave after pass #10:
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg
Leave replication running
Upload 1 object, test6.jpg, to the master. Check the master:
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg
2014-02-07 02:03 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg
2014-02-07 02:03 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg
2014-02-07 02:06 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test6.jpg
Check contents of slave after next pass:
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg
Upload another file, test7.jpg, to the master:
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg
2014-02-07 02:03 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg
2014-02-07 02:03 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg
2014-02-07 02:06 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test6.jpg
2014-02-07 02:08 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test7.jpg
The slave doesn't get it this time:
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg
Upload another file, test8.jpg, to the master:
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg
2014-02-07 02:03 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg
2014-02-07 02:03 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg
2014-02-07 02:06 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test6.jpg
2014-02-07 02:08 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test7.jpg
2014-02-07 02:10 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test8.jpg
The slave gets the 3rd file:
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg
2014-02-07 02:02 10k
dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg
So I think the problem is caused by the shard marker being set
to the current marker after every pass, even if the bucket
replication caps on max-entries.
Updating the shard marker by uploading a file causes another
pass on the bucket, and the bucket marker is being tracked
correctly.
I would prefer to track the shard marker better, but I don't see
any way to get the last shard marker given the last bucket
entry. If I track the shard marker correctly, then the stats
I'm generating are still somewhat useful (if incomplete). I'll
be able to see when replication falls behind because the graphs
keep growing.
The alternative is to change the bucket sync so that it loops
until it's replicated everything up to the shard marker. In
this case, I'll be able to see that replication is falling
behind because each pass takes longer and longer to complete.
What do you guys think?
Either way, I believe all my data is waiting to be replicated.
I just need to fix this issue, and upload another object to
every bucket that's behind.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|