Hi rgw folks, this is a rough design for cleanup of deleted buckets in
multisite. I would love some review/feedback.
Motivation:
- Bucket deletion in a multisite configuration does not delete
bucket instance metadata, bucket sync status, or bucket index objects on
any zone. This allows bucket sync on each zone to finish processing
object deletions and (hopefully) converge on empty.
Requirements:
- Remove all objects associated with deleted buckets in a timely
manner:
- bucket instance metadata, bucket index shards, and bucket
sync status
- all object data
- Does not rely on bucket sync to delete all objects [zone A may
delete an empty bucket that hasn't yet synced objects from zone B, so
the zones would converge on zone B's objects]
- Strategy to clean up already-deleted buckets, ie 'radosgw-admin
bucket stale-instances rm' command
Summary:
- Add a process for 'deferred bucket deletion', where local bucket
instance metadata is removed and the bucket index/data are scheduled for
later 'bucket gc'. A new 'bucket gc list' is stored in omap and
processed by a worker similar to existing gc.
- For metadata sync, the metadata log format needs to be extended
to distinguish between normal writes and deletion events on bucket
instances. When metadata sync encounters a bucket instance deletion, it
runs 'deferred bucket deletion'.
- Data sync on the bucket needs to avoid creating new objects while
bucket gc is running.
mdlog:
- entries must distinguish between Write, Remove, and Delete (where
Delete implies gc of associated data)
- a 'bucket rm' Deletes its bucket instance metadata
- a 'bucket reshard' Removes the old bucket instance because the
new bucket instance still owns the data
Bucket gc list:
- stored in omap in the log pool
- sharded over multiple objects
- each entry encodes RGWBucketInfo (needed to delete objects after
bucket instance is deleted)
Bucket index:
- add REMOVE_ONLY flag to bucket index to prevent object creation
from racing with bucket gc
Deferred bucket delete:
- flag bucket index shards as REMOVE_ONLY
- add to 'bucket gc' list (entry includes encoded RGWBucketInfo)
*requires access to existing bucket instance metadata*
- delete local bucket instance (add Delete entry to mdlog)
Metadata sync:
- must serialize sync of mdlog entries with the same metadata key,
to preserve order of Writes vs Removes/Deletes
- can skip Writes if they're followed by Removes/Deletes
- on Delete of bucket instance, run deferred bucket delete
- backward compatibility: what to do with mdlog entries that don't
specify Write/Remove/Delete?
- for bucket instance: assume write (because we never deleted
them before upgrade), and just try to fetch
- for other metadata: use existing strategy to fetch remote
metadata, and remove local metadata on 404/ENOENT
Bucket sync:
- bucket sync first fetches bucket instance - on ENOENT, exit
bucket sync with success
- if sync_object() returns REMOVE_ONLY error from bucket index,
exit bucket sync with success
- read/fetch bucket instance metadata before taking lease to avoid
recreating bucket sync status objects
Bucket gc worker:
- for each bucket in gc list:
- decode RGWBucketInfo
- delete each object in bucket [should we GC tail objects or
delete inline?]
- delete incomplete multiparts
- delete bucket index objects
- delete bucket sync status objects
radosgw-admin bucket stale-instances rm:
- run deferred bucket delete on each bucket instance that:
- does not have an associated bucket entrypoint
- has a bucket id matching its bucket marker? (has not been
resharded)
- must be safe to run on any zone after upgrade