the issue of rgw sync concurrency

star fan <jfanix@xxxxxxxxx> · Thu, 15 Jul 2021 19:36:13 +0800

We found some unnormal status of sync status when running multiple
rgw(15.2.14) multisize sync, then I dig into the codes about rgw sync.
I think there is a issues of rgw sync concurrency implementation if I
understand correctly.
The implementation of  the critical process which we want it run once,
which steps are as below:
1. read shared status object
2. check status
3. lock status
4. critical process
5. store status
6. unlock

It is a problem in concurrent case that  the critical process would
run multiple times because it uses old status, thus it makes no sense.
The steps should be as below
1. read shared status object
2. check status
3. lock status
4. read and check status again
5. critical process
6. store status
7. unlock

one example as below
do {
r = run(new RGWReadSyncStatusCoroutine(&sync_env, &sync_status));
if (r < 0 && r != -ENOENT) {
tn->log(0, SSTR("ERROR: failed to fetch sync status r=" << r));
return r;
}

switch ((rgw_meta_sync_info::SyncState)sync_status.sync_info.state) {
case rgw_meta_sync_info::StateBuildingFullSyncMaps:
tn->log(20, "building full sync maps");
r = run(new RGWFetchAllMetaCR(&sync_env, num_shards,
sync_status.sync_markers, tn));

And there is no deletion of omapkeys after finishing sync entry in
full_sync process, thus full_sync would run multiple times in
concurrent case.

It has  no importance impact on data sync because bucket syncing is
idempotence，but no metadata sync