Re: questions about multisite sync qos

Casey Bodley <cbodley@xxxxxxxxxx> · Thu, 18 Jan 2018 12:12:24 -0500

Hi, (and cc ceph-devel)

On 01/18/2018 12:32 AM, Tianshan Qu wrote:
Hi:

Do we have any design about control the sync process?

For now we have 128 data log shards, with at most 20 bucket shards
concurrent and 20 objects each. My considers:

1.The oldest data maybe lagging, if 20 bucket shards has continuous
data, it will always do the sync.
A solution is sync 1k objects each time, and then back into a list
type waiting queue will simply fix the problem.

2.We may need a bucket level priority sync, some data maybe hot some
maybe cold, so the hot one should be first, and maybe a configurable
policy.
still with the queue and control the sync objects number with
different priority, will somehow archive the goal.

The 'datalog notify' feature offers some help to prioritize the hot 
bucket shards. RGWDataNotifier will periodically broadcast a datalog 
notify message to other zones telling them which shards have changed 
recently. Then RGWDataSyncShardCR::incremental_sync() will start sync 
for those shards under /* process out of band updates */. These 
out-of-band updates (along with the error retries) are allowed to exceed 
the 20-shard spawn_window, though we do wait until we're under that 
window again before checking for new notifications.

So you're right that there's still potential for starvation if we get 20 
bucket shards that are continuously busy, and I think it's a good idea 
to add a ~1000 object limit to bucket sync. We could use something 
similar to the error_repo to queue these unfinished buckets for later, 
but there are still some problems we'd have to solve there:

- How to schedule/prioritize between the different datalog sources 
(datalog notify, error_repo, datalog, and unfinished buckets)?
- If this is a queue with a timestamp as the index, can we prevent 
duplicate entries?
- Can we remove a bucket shard from this queue if we processed it 
through another source (like 'datalog notify') already?
- Can we make any guarantees about bounds on its size?

3.Another observation is list bucket bilogs will eat all connections
another side in small cluster, as rocksdb is busy with list, each
connection will stuck much time. Adjust the config will easy the
problem, but if we have some global control with cmd can adjust that
will be better,  and I think RGWDataSyncEnv is a good place to do the
work.

Are you referring to the frontend threads being saturated with sync 
requests? That is definitely an issue, and it's one thing that we're 
trying to solve with the beast frontend. It also gives us the ability to 
do some qos, which could provide some fairness between sync and normal 
client requests.

Outside of the frontend, there's also more that we could do in multisite 
to share/reuse/throttle these sync connections instead of generating a 
new one for each request.

Any advice will be appreciated.
Thank you!

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html