On 09/03/2018 05:02 AM, Xinying Song wrote:
Hi, cephers: We have been suffering a problem of rgw-multisite. The `radosgw-admin sync status` sometimes show data shards are behind to peers. If no more log entries are added to the corresponding shard of peer zone, i.e. no new writes, sync marker of this shard is stuck on that old marker and no proceed. Restart rgw daemon can resolve this warning. RGW log shows syncmarker in incremental_sync() function has been updated to peer's newest marker. Gdb shows pending and finish_markers variables of marker_tracker are empty. (i forget to see syncmarker variable...) . I guess this problem is caused by the non-atomic marker update. Since update marker is handled by an RGWAsyncPutSystemObj op, those ops may be dis-ordered when delivered to rados. Maybe we should add an id_tag attr to ensure this op is atomic. This problem is not easy to reproduce in testing enviroment, so I prefer to ask you guys for some advice first, in case I'm in the wrong way. Thanks.
I think Yehuda saw this while testing the cloud sync work, and added a RGWLastCallerWinsCR to guarantee the ordering of marker updates in commit 1034a68fd12687ac81e6afc4718dbc8045648034. Does your branch include that commit, or is it based on luminous? We won't be backporting cloud sync as a feature, but we should probably take that one commit - I opened a ticket for this backport at http://tracker.ceph.com/issues/35539.
Thanks, Casey