On 2/4/14 17:06 , Craig Lewis wrote:
Using my --max-entries fix (https://github.com/ceph/radosgw-agent/pull/8), I think I see what's happening. Shut down replication Upload 6 objects to an empty bucket on the master: 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg 2014-02-07 02:03 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg 2014-02-07 02:03 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg None show on the slave, because replication is down. Start radosgw-agent --max-entries=2 (1 doesn't seem to replicate anything) Check contents of slave after pass #1: 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg Check contents of slave after pass #10: 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg Leave replication running Upload 1 object, test6.jpg, to the master. Check the master: 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg 2014-02-07 02:03 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg 2014-02-07 02:03 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg 2014-02-07 02:06 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test6.jpg Check contents of slave after next pass: 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg Upload another file, test7.jpg, to the master: 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg 2014-02-07 02:03 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg 2014-02-07 02:03 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg 2014-02-07 02:06 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test6.jpg 2014-02-07 02:08 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test7.jpg The slave doesn't get it this time: 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg Upload another file, test8.jpg, to the master: 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test3.jpg 2014-02-07 02:03 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test4.jpg 2014-02-07 02:03 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test5.jpg 2014-02-07 02:06 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test6.jpg 2014-02-07 02:08 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test7.jpg 2014-02-07 02:10 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test8.jpg The slave gets the 3rd file: 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test0.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test1.jpg 2014-02-07 02:02 10k dc5674336e2212a0819b7abcb811e323 s3://bucket1/test2.jpg So I think the problem is caused by the shard marker being set to the current marker after every pass, even if the bucket replication caps on max-entries. Updating the shard marker by uploading a file causes another pass on the bucket, and the bucket marker is being tracked correctly. I would prefer to track the shard marker better, but I don't see any way to get the last shard marker given the last bucket entry. If I track the shard marker correctly, then the stats I'm generating are still somewhat useful (if incomplete). I'll be able to see when replication falls behind because the graphs keep growing. The alternative is to change the bucket sync so that it loops until it's replicated everything up to the shard marker. In this case, I'll be able to see that replication is falling behind because each pass takes longer and longer to complete. What do you guys think? Either way, I believe all my data is waiting to be replicated. I just need to fix this issue, and upload another object to every bucket that's behind. |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com