Currently it is assumed that ceph clusters are concentrated in a single geographical location. Thus, the rados gateway, which leverages the ceph object storage (RADOS) for its backend store is limited to a single location. There are two main issues that we would like to solve: - Disaster Recovery The ability to have a solution for disaster scenarios. - Different Geographical Locations This has two advantages: one is the fact that data can be located in distinct locations, which reduces the risk of losing all data due to unfortunate events. It also enables users to store data closer to where they are located, which can mean faster and better service. The following describes these features and their implementation. Comments and questions are welcome as always. Yehuda Terminology ============ - zone A unique set of pools within a single Ceph cluster. - region A set of one or more zones - master region The root region, the one where user metadata modifications take place in - master zone The primary zone in a region, where all writes go into. - backup/slave zone A zone which asynchronously receives copies of buckets from the master zone. Description ============ In a geographically disperse object storage system, data may be put in buckets that are located in different regions. A bucket is created in a specific region, and all data created under this bucket (as in objects) will reside in that region. Users can create buckets in any of the regions, and user's list of buckets will not be tied to a specific region. User's metadata (that also includes buckets' list) will be maintained in a single region and will be replicated to all the regions. Different zones within a specific logical region will be used for disaster recovery. Data will be replicated from the master zone to the backup zones in an asynchronous way. Each zone contains the user metadata pool(s) and one or more data pool. Each rgw instance controls a single zone instance. There can be multiple rgw instances per zone. Multi tenancy can be achieved within a single ceph cluster by defining different set of metadata and data pools. A region map will describe the different regions and zones that form these regions. region: - name - list of zones within the region - master zone name - status (master / regular) [whether controls user metadata] zone: - name - endpoint URL(s) - status (master / slave) The region map will be configured by the admin, and will be installed on all zones. Roles changes (e.g., moving a zone from a master to slave role) will be achieved by updating the region map. Bucket metadata now also includes the region where this bucket resides. When a bucket is created on any region other than the master region, the creation request will be forwarded to the master region. Metadata will be replicated from the master region (not zone) to the other regions. New radosgw-admin commands will allow configuring and retrieving the region map. RESTful APIs ================= A set of RESTful APIs will be defined to implement the geo-replation and disaster recovery features. The APIs will expose the newly developed functionality that will be implemented. Zones will interact only using the RESTful API via their defined endpoints. Operations like copy object will now need to handle the case of reading/writing from/to a different region. A new 'service' user role/cap will be added and will allow one rgw instance to interact with another. User Metadata Replication ------------------------------------- User metadata is naturally updated less frequently than user data. User metadata will be modified only on the designated master region. Any user metadata change will be synchronously written to a metadata-changes log. Log will contain list of changes indexed by a change id that will increase monotonically. Log will be striped across multiple log objects, each will have its own unique id. A remote site that fetches changes will specify the different log versions it had last seen (if any). The returned response will be a list of changes and the corresponding log ids and log versions for each change. Change records will consist of: - log id - log version - resource type (user / bucket) - resource operation + specific operation data (e.g., bucket created, user key removed, etc.) Log operations should be idempotent. A set of RESTful admin API requests will allow requesting the complete list of users and their corresponding metadata. New API will consist of the following: - a request to list all existing users - a request to retrieve user info + list of buckets for requested users / all users - a request to retrieve list of changes since specific logs version - a request to retrieve current logs version First synchronization of user metadata to the remote region will be done as follows: - read logs versions - retrieve user info for all users, apply locally - read any new changes and apply them A metadata replication agent will run on the remote master zones, and on the slave zones. The master zones will follow the master zone on the master region. The slave zones will follow their own master zones. The replication agent will periodically query for new changes and apply them locally. Disaster Recovery Implementation ========================= The following discusses a disaster recovery implementation. * Master, backup zones A backup (slave) zone follows a master zone. It is intended that clients will only be accessing the master zone. Read only access to the slave will be possible, however, data in the slave may be outdated. It is possible to switch master and backup settings of a zone. The idea is to have a primary zone, and one or more backup zones following it. Updates will be logged per bucket on the primary, and list of modified buckets will also be logged. Backup zones will poll the list of modified buckets, and will then retrieve the changes per bucket. Changes will be applied on the backup zone. As stated, this is a disaster recovery solution, and comes to provide safety net for complete data loss. The solution does not provide a complete data loss protection, as latest data that has not been transferred to the slaves may be lost. Master ------- Bucket index log ^^^^^^^^^^^^^^^^ The 'bucket index log' will keep track of modifications made in the bucket (objects uploaded, deleted, modified). The following modifications will be made to the bucket index: * bucket index version The bucket index will keep an index version that will increase monotonically. * log every modify operation The bucket index will now log every modify operation. An additional index version entry will be added to each object entry in the bucket index. * list objects objclass operation returns more info list objects objclass operation will also return last index version, as well as the version of each object (the bucket index version when it was created). This will allow radosgw to retrieve the entire list of objects in a bucket in parts, and then retrieve all the changes that happened since starting the operation. It is required so that we could do a full sync of the bucket index. * operation to retrieve bucket index log entries A new objclass operation will retrieve bucket log index entries. It will get a starting event number, max entries. When requested with a * operation to trim bucket index log entries A new objclass operation to remove bucket log index entries. It will get a starting event number (0 - start from the beginning), last event number. It will not be able to remove more than predefined number of entries. A new set of RESTful API calls will be implemented within rgw and will export the new bucket index log functionality. Updated Buckets Log ^^^^^^^^^^^^^^^^^^^ A log that contains the list of modified buckets within a specific period. Log info may be spread across multiple objects. This will be done similarly to what we did with the usage info, and with the garbage collection. Each bucket's data will go to a specific log object (by hashing bucket name, modulo number of objects). The log will use omap to index entries by timestamp, and by a log id (monotonically increasing) and will be implemented as an objclass. * log resolution We'll define a log resolution period. In order to avoid sending extra write for updating this log for every modifications, we'll define a time length for which a log entry is valid. Any update that completes within that will only be reported once in the log. A bucket modification operation will not be allowed to complete before a log entry (with the bucket name) was appended to the bucket operations log within the past ttl (the cycle in which the operation completes). That means that the first write/modification to that bucket will have to send an append request. All bucket modification operations that happen before its completion (and within the same log cycle) will have to wait for it. The radosgw will hold a list of all the buckets that were updated in the past two cycles (within the specific instance) and every cycle will log these entries in the updated buckets log. Backup (Slave) --------- * Bucket index log The bucket index log will also hold the last master version. * Processing state The sync processing state will be kept in a log. This will include latest updated buckets log id that was processed successfully. * Full sync info A list that contains the names of the buckets that require a full sync. It will also be spread across multiple objects. Full Sync of System ^^^^^^^^^^^^^^^^^^^ Does the following:: - retrieve list of all buckets, update the full sync buckets list, start processing Processing a single 'full sync list' object:: - if successfully locked object then: - (periodically, potentially in a different thread) renew lock - for each bucket - list objects (keep bucket index version retrieved on the first request to list objects). - for each object we get object name, version (bucket version), tag - read objects from master (*), write them to local (slave) cluster (keep object tag) - when done, update local bucket index with the bucket index version retrieved from master - unlock object (*) we should decide what to do in the case of tag mismatch Continuous Update ^^^^^^^^^^^^^^^^^ A process that does the following:: - Try to set a lock on a updated buckets log - if succeeded then: - read next log entries (but never read entries newer than current time - updated buckets log ttl) - for each bucket in log: - fetch bucket index version from local (secondary) bucket index - request a list of changes from remote (primary) bucket index, starting at the local bucket index version - if successful (remote had the requested data) - update local data - if not successful - add bucket to list of buckets requiring full sync - renew lock until done, then release lock - continue with the next log entry We still need to be able to fully sync buckets that need to catch-up. So also do the following (in parallel):: - For each object in full sync list - periodically check list of buckets requiring full sync - if not empty: - for each bucket: full sync bucket (as specified above), remove bucket from list Replication Agent ^^^^^^^^^^^^^^^^^^ The replication agent will be a new software that will be used to manage handling the bucket synchronization from the master zone to the slave zone. The replication agent will read the changes on the master zone, and will apply them on the slave zones. It is possible that the replication agent will not read/write any actual user data, but rather send a copy command to the slave rgw instance. Note that the replication agent will use the newly RESTful API calls and will not use rados calls directly. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html