Motivation:
When buckets are removed on the master zone in a multisite
configuration, the bucket instance and index objects are never removed
(only the bucket entrypoint is removed, which makes the bucket
unreachable). The reason we don't remove the instance/indices is because
the peer zones may not have finished processing all of the object
removal entries for that bucket, and they can't make progress with that
if the master zone is unable to serve the bucket instance metadata or
bucket index logs. This would result in leaked objects/data.
Tracker issue: http://tracker.ceph.com/issues/20802
Design:
In short, the master zone will actually delete its bucket instance and
index objects in RGWRados::delete_bucket(), and peer zones will learn to
deal with it during sync.
metadata sync:
When the master zone deletes a bucket instance, it writes that bucket
instance to the metadata log. When another zone sees this entry during
metadata sync, it will:
1) set a new 'remove-only' flag on each of its bucket index objects,
2) write an entry for each bucket index shard to a new 'bucket gc log'
(described below), and
3) remove its local copy of the bucket instance.
bucket sync:
When bucket sync discovers that its bucket has been removed (either from
getting ENOENT when trying to fetch the bucket instance metadata from
the master zone, or from seeing the 'remove-only' flag on the local
bucket index when trying to sync an object), it deletes its sync status
object and stops processing that bucket (using similar logic to the
'bucket sync disable' feature in development at
https://github.com/ceph/ceph/pull/15801). If the data changes log
triggers another attempt to sync this bucket shard, it will try to fetch
its bucket instance metadata from the master zone and fail with ENOENT.
bucket gc:
Since we stop trying to sync deleted buckets, they may still contain
objects. All objects need to be deleted before we can remove the bucket
index objects themselves. This process is deferred to a background
thread using a new 'bucket gc log'. Object deletion will keep the bucket
index consistent so that it can resume its progress across radosgw
restarts. The 'remove-only' flag on the bucket index will allow this
deletion, while preventing bucket sync from adding new objects. Once all
objects are removed from each bucket index, the bucket index object can
be safely deleted and its entry trimmed from the bucket gc log. The
bucket gc worker thread will process the first N entries in parallel,
where N is configurable with a default of ~16. A rados lock on the log
object will prevent other gateways in the zone from duplicating the work
(similar to DataLogTrimPollCR for datalog trimming).
Casey
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html