[RFC] rgw: sync info provider

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Wed, 1 Apr 2020 20:08:28 +0300

Sync Info: Information about the sync entities (for example: meta,
data, bucket).

* Rationale

To simplify the sync code and to separate the sync logic from the info
layout. To provide a framework that could enable alternative sync
entity providers.

* Overview

The rgw sync code handles sync of both metadata and buckets data. It
pulls information about the sync entities from 3 separate types of
providers: meta, data (which buckets need to sync), bucket. Sync for
each of the types is split into two different stages: full sync and
incremental sync. Full sync means that all the relevant data is being
iterated over: listing of all meta keys, listing of all bucket
instances, listing of all keys in every bucket. Incremental sync means
that the different changes logs are being read and changes are being
applied. Stemming from this is that we have 6 separate implementations
for the core sync process that vary in the way sync information is
fetched, but repeat the same general logic: fetch information, apply
changes, update markers, handle errors. Therefore there is an
opportunity to consolidate the sync logic into common generic code. At
the minimum we could combine the full sync and incremental stages of
each entity type. Other considerations, like multi-stage bucket sync
that supports resharding could benefit from that.

* Details

new module: sync info provider

The sync info provider will be responsible for providing
meta-information about the sync entity (e.g., how many shards in the
different sync stages, current state for each stage), and will provide
serial sync info by marker. The marker itself will be opaque to the
target and will reflect the sync stage. The sync info provider will
return the list of source entities that need to be synced, and the
target should be able to apply those changes without specific
knowledge of each different stage.
The info provided will start with the 'full sync' data (e.g., for
metadata sync it will include all the metadata keys), and when that is
exhausted, it will include all the metadata log entries. The structure
of each info entry will be the same whichever stage it is.
Each stage could have different sharding properties. For example: the
first stage (full sync) can have a single shard, and the second stage
(incremental sync) can have many more shards. When starting the sync,
the target will send init request that will return a list of initial
markers for each shard in any of the existing stages. The shard id and
the stage id information will be embedded within the marker.

Trimming: We can leverage this system to simplify the trim logic. The
sync info provider could keep a list of all the targets and their
corresponding current synced marker position (that the targets will
provide). The trimmers could then use that information instead of
polling the targets for their current state. We can get rid of the
trimmers polling scheme altogether and the sync info providers could
maintain a central in-memory list of trimming targets that will
periodically be trimmed. (Need to consider backward compatibility)

The sync info providers should be running at the source zone and a
RESTful api should be created to access their functionality. They
should read in the source zone so that sync info about targets can be
aggregated (for trimming purposes). However, we should create a
target-side interface that would provide a functional interface for
the sync code. We can create alternative target-side implementations
for managing sources that do not support this API (e.g., backward
compatibility and other non-rgw sources).

A sync info provider client wrapper can be created to enable cases
like full sync of metadata where the source info is not sharded. The
generic wrapper would fetch the data and store it in temporary queues
(the same as the current full metadata sync) so that the sync process
itself could happen concurrently by multiple shards. It should be
transparent to the caller, and from its point of view it would fetch
the data from a sharded source.

 sync_info_init:

input: {
  my_id
  entity_id
}

output: {
  sync_info_id
  stages[] = {
    stage_id
    num_shards
    markers[] = {
      string marker
    }
  }
}

sync_info_fetch:

input: {
  sync_info_id
  marker
  max_entries
  optional: sync_position (for trimming)
}

output: {
  status = { have_more | done | stage_done | who_are_you }
  entries[] = {
    marker_id
    info (depending on entity type)
  }
}

* The entry info will depend on the entity type. It will usually
include a timestamp and other type-specific fields. For example, the
bucket entity type would include an op field that data entities would
not have.
*  When the data of a specific sync stage is exhausted, the status
will reflect it.
* The source might decide to remove a target from its list of targets
if a target hasn't contacted it for a while. This can happen if a
target went down, and is needed to allow the source to trim its logs.
If the target returns then the source will send an error message that
will require the target to re-initialize its sync process.
* if there is no more data in the current stage, the stage_done status
will be returned. The target will only start working on the next stage
after all the current stage shards are complete. When all shards are
complete, the sync process should initiate a sync on all the new stage
shards. If target does not have any information for the next stage
(e.g., after bucket reshard), it will query the source for that
information

sync_info_update_position:

input: {
  sync_info_id
  sync_position_marker
}

output: {
  status
}

sync_info_next_stage_info:

input: {
  cur_stage_id
}

output: {
  status
  next_stage_id
  num_shards
  markers[] = {
    string marker
  }
}

Note that the returned markers will be the needed markers for
transitioning to the next stage. The markers returned when
transitioning from full sync to incremental sync reflect the max logs
position when the sync started. The markers returned when
transitioning from different incremental sync stages (e.g., different
reshard generations) are the minimum log positions (or even empty
position) of the next generation.

* Development

Initial work & Metadata Sync

* Source-side sync info provider core
   * define functional interfaces
   * abstract SyncInfoProvider
   * abstract SIPEntity
   * Control
   * store/read target state
   * marker tools
* First implementation
   * create provider for metadata
   * SyncInfoProvider_Meta
   * SIPEntity_Meta
   * radosgw-admin to control SyncInfoProvider hooks
* REST api
   * Target-side core
   * define abstract SIPClient
   * implement SIPClient_REST
   * Coroutines implementation
   * radosgw-admin to control SIPClient hooks
* Meta sync
   * modify sync init
   * convert incremental sync to use SIP
   * stage transition
   * remove full sync
* Source side trimming module
   * generic core
   * implement mdlog trimmer

TODO:
- backward compatibility plan
- trimming in mixed versions environment?
- radosgw-admin sync status
- testing

Data & Bucket Sync
* Data Sync
   * source side SyncInfoProvider_Data
   * convert sync code similar to meta sync
   * Bucket Sync
   * SyncInfoProvider_BucketInstance
   * bucket instance sync code
* Optional
   * common sync core
   * modify all implementations to use common core

Yehuda
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx