We discussed revamping the rgw multi site feature. This includes both simplifying the whole configuration process, and reimplementing the whole process. Also, it will support active-active zone configuration, so multiple zones within the same zonegroup (formerly known as 'region') will be writable. The following discusses the configuration changes, and the implementation details that we discussed recently. 1. Configuration Changes 1.1. Zonegroup map The zonegroup map holds a map of the entire system, and certain configurables for the different zonegroups and zones. It holds the relationships between the different zonegroups and other configuartion: For the entire map - which zonegroup is the master For each zonegroup - access url[s] - existing storage policies For each zone - id, name - access url[s] - peers In the new configuration scheme, the master zonegroup will be in control of the zonegroup map. In order to make a change to the system configuration, a command will be sent to the url of the master zonegroup, and the new configuration will propagate to the rest of the system. rgw will be able to handle dynamic changes to the zonegroup and zone configuration. There will be one zone that will be designated as the master in the master zonegroup, and will manage all user and bucket creation. zonegroup map will have a version epoch that will increment after every change. 1.2. Defining a new zonegroup Currently, in order to define a new zonegroup, we need to inject a json that holds the zonegroup configuration, then we need to update the zonegroupmap, and then we need to distribute that zonegroupmap into all existing zonegroups and restart all rgws for that to take effect. I don't think this is a good scheme. A zonegroup will have a zonegroup id, and a zonegroup name. For backward compatibility, older zonegroups will have their zonegroup_id equal to their name. When setting up a new zonegroup, we'll need to specify an entry point for the 'master' zonegroup. That zonegroup will be in control of the zonegroupmap, and it will distribute the zonegroupmap updates to all zones. If the zonegroup that we set up is the first zonegroup, we'll need to specify it in the command line. We won't be able to set up a secondary zonegroups if the master has not been specified. 1.3. Defining a new zone Currently, when running an rgw it does the following: Read the rgw_zone configurable, check the root pool for the configuration of this zone. If rgw_zone is not defined it will read the default zone name out of the it will create the 'default' zone, and assign it as the default. Once a zone name has been set, it cannot really be changed. The zone names are embedded in the rados object names that are created to hold the actual rgw objects. In order to support zone renaming, and more dynamic configuration we should create a logical 'zone id' that the zone name will point at. The zone id will be a string. When creating a new zone it will be auto generated, and will not be modified. For backward compatibility, older zones will have a zone_id that will match their zone name. To set up a new zone, the rgw command will include the url to the master zonegroup, and keys to access it. It will also include the name of the zonegroup this zone should reside in. If this zonegroup does not exist, it will be created (if appropriate param was passed in). The master zonegroup will create a new system user for this specific zone, and will send it back. When a new zone starts up, we'll auto-create all the rados pools that it will use. It will first need to determine whether pools already exist, and are already assigned to a different zone. The naming scheme for the pools would be something like: .{zone_id}-x-{pool-name} 1.4. Dynamic zonegroup and zone changes rgw will be able to identify changes to the zonegroupmap, and to the zone configuration. This will be done by the following: rgw will be able to restart itself with a new rados backend handler (RGWRados) after detecting that a configuration change has been made. It will finish handling existing requests, but restart all the frontend handlers with the new RGWRados config. rgw will set a specific watch/notify handler that will be used to getting updates about the zonegroupmap configuration. Upon receiving a change, the master zonegroup zone will send a message to all the different zonegroups about the new configuration change. Any synchronization activity will be dynamically re-set according to the new configuration. 1.5. New RESTful apis 1.5.1. Initialize new zone Will be sent by the config utility (probably radosgw-admin) to the master zonegroup. POST /admin/zonegroup?init-zone Input: a JSON representation of the following: - zonegroup name - zone name - zone id - list of peers (zone ids) Output: a JSON representation of the following: - metadata of user to be used by zone - new zonegroup map 1.5.2. Notify of zonegroup map change POST /admin/zonegroup?reconfigure Input: - new zonegroup map 1.6. New radosgw-admin, radosgw interfaces: 1.6.1 Init new zonegroup $ radosgw-admin zonegroup init --zonegroup=<name> [--master | --master-url=<url>] When doing a remote command that contacts the master zonegroup, we'll also need to provide a uid, and access key. This can be done by specifying --uid and --access-key on the command line (which is a bit of a security problem), or by setting it in ceph.conf (which is a bit of a pain). 1.6.2 Init a new zone $ radosgw-admin zone init --rgw-zone=<zone_name> --zonegroup=<zonegroup_name> --url=<zone url> [--master | --master-url=<url>] This command will either set the initial master zone for the system, or wil create a new zone. Optionally we can create a new zone implicity by running radosgw against a non existing zone, and specifying either --master or --master-url=... 1.6.3 Modifying zone configuration: - Connect zone to another peer $ radosgw-admin zone modify [--rgw-zone=<zone name>] --connect=<peer name> - Disconnect zone from another peer $ radosgw-admin zone modify [--rgw-zone=<zone name>] --disconnect-<peer name> - Configure a zone placement target (storage policy) $ radosgw-admin placement modify --placement-target=<name> ... (TBD what exactly) - Check zone sync status: $ radosgw-admin sync status [--rgw-zone=<zone name>] Will provide current markers and timestamps for specified zone. 1.7. A usage example. Setting up two onegroups, with two zones in each: Zonegroup: us-west Zone: us-west-1 (ceph cluster 1) - url: http://us-west-1.example.com Zone: us-west-2 (ceph cluster 2) - url: http://us-west-2.example.com Zonegroup: us-east Zone: us-east-1 (ceph cluster 2) - url: http://us-east-1.example.com Zone: us-east-2 (ceph cluster 3) - url: http://us-east-2.example.com - In ceph cluster 1: $ radosgw-admin zonegroup init --zonegroup=us-west --master --url=http://us-west-1.example.com $ radosgw-admin zone init --rgw-zone=us-west-1 --zonegroup=us-west --url=http://us-west-1.example.com $ radosgw --rgw-zone=us-west-1 - In ceph cluster 2: $ radosgw-admin zone init --rgw-zone=us-west-2 --zonegroup=us-west --url=http://us-west-2.example.com --master-url=http://us-west-1.example.com $ radosgw --rgw-zone=us-west-2 $ radosgw-admin zonegroup init --zonegroup=us-east --url=http://us-east-1.example.com --master-url=http://us-west-1.example.com $ radosgw-admin zone init --rgw-zone=us-east-1 --zonegroup=us-east --url=http://us-east-1.example.com --master-url=http://us-west-1.example.com $ radosgw --rgw-zone=us-east-1 - in ceph cluster 3: $ radosgw-admin zone init --rgw-zone=us-east-1 --zonegroup=us-east --url=http://us-east-2.example.com --master-url=http://us-west-1.example.com $ radosgw --rgw-zone=us-east-2 Note that these commands don't include the access keys to access the master zone. This will also need to be set, either through the command line, or via ceph.conf. 1.8. Optional simplification: Instead of creating a zone and running radosgw, we can do it in one step via radosgw itself, e.g.: $ radosgw --rgw-zone=us-west-1 --zonegroup=us-west --init-zone --url=http://us-west-1.example.com We can do the same for the zonegroup creation, so that every zone + zonegroup creation can be squashed to a single radosgw command. 2. New multizone implementation details Here's the new sync scheme that we discussed. Note that it's very similar to the old scheme, but it adds a push notification. It does not specify how concurrency between multiple workers will be achieved, but there are a few ways to implement that: the same as with the old sync agent (lock shards), have a single elected worker per zone (use watch/notify for election), use watch-notify to sync work, specify workers manually, and potentially other solutions. Note that this is going to be implemented as part of the gateway, which gives us more flexibility in how to leverage rados to store the sync state. Cross zone communication will still be done using RESTful api. The idea is to work roughly at the same premise that we've been working before. We'll have 3 logs: metadata log, data log, bucket index log. We'll add push notifications to make changes appear quicker on the destination. The design supports active-active zones, and federated architecture. 2.1. Multi-zonegroup, multi-zone architecture There still is only a single zone that is responsible for metadata updates. This zone is called the 'master' zone, and every other zone needs to make metadata changes against it. Each zonegroup can have multiple zones. Each zone can have multiple peer zones, but not necessarily all the zones within that zonegroup. But it is required that there is a path between all the zones in the zonegroup (a connected graph). zonegroup: name is_master? master zone list of zones zone: containing zonegroup list of peers zone endpoints Each bucket instance within each zone has a unique incrementing version id that is used to keep track of changes on that specific bucket. A zone keeps a sync state of where it is synced with regard to all its peers. A zone keeps a metadata sync state against the master zone. zone_data_sync_status: state: init, full_sync, incremental list of bucket instance states bucket_instance_state: full_sync (keep start_marker+position) | incremental (keep position) list of object retries The idea is that if we're doing a full sync of the bucket, we need to keep the source zone bucket index position, so that later on we'll catch all changes that went in since we started full syncing this bucket. We also keep the position of where we are at the full sync (what object we last synced). Also, before starting the full sync, we need to keep the state in the data (changed buckets) log. When we're at the incremental stage, we need to keep the bucket index position. We follow the data log and sync each bucket instance that changed there. Also, for every failed object sync we need to keep a retry entry. zone data sync stages: init: Fetch the list of all the bucket instances and keep them in in a sharded sorted list sync: for each bucket if bucket does not exist, fetch bucket, bucket.instance metadata from master zone sync bucket Also, we need to keep a list of all the buckets that have objects that need to be resent Metadata sync: Similar to the data sync: metadata_sync_status: state: init, full_sync, incremental At the init state: keep the position of the metadata log. List all the metadata entries that exist and keep them in a sharded sorted list. Full sync: for each entry in list, sync (fetch and store). Incremental: follow changes in metadata log, store changes Status inspection: Provide the status of each zone, as a difference with regard to its peers (e.g., mtime of oldest non-synced change) Push notifications: A zone will send changes as they happen to all its connected peers. It will either send it as a change by change, or accumulate a few changes for a period of time and then send. These are just hints for the peers so that they could get the changes quicker, but if these are missed they will be picked up by the zones later through their regular sync process. The notifications will be done using a POST request between the source zone and the destination zone. 2.2. Active-active considerations Each change has a 'source zone' assigned to it. A change will not be applied if the dest zone's version mtime is greater or equal - we should keep a higher precision mtime as an object attribute, the stat() mtime only uses seconds, problematic -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html