"Not involve bi." yes, albeit, we're trying to update that process and its assumptions, in parallel. Matt On Tue, Apr 30, 2019 at 9:17 PM Yehuda Sadeh-Weinraub <ysadehwe@xxxxxxxxxx> wrote: > > See my comments below. In general the plan looks good to me. > > Yehuda > > On Tue, Apr 30, 2019 at 1:42 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote: > > > > Hi rgw folks, this is a rough design for cleanup of deleted buckets in > > multisite. I would love some review/feedback. > > > > Motivation: > > - Bucket deletion in a multisite configuration does not delete > > bucket instance metadata, bucket sync status, or bucket index objects on > > any zone. This allows bucket sync on each zone to finish processing > > object deletions and (hopefully) converge on empty. > > > > Requirements: > > - Remove all objects associated with deleted buckets in a timely > > manner: > > - bucket instance metadata, bucket index shards, and bucket > > sync status > > - all object data > > - Does not rely on bucket sync to delete all objects [zone A may > > delete an empty bucket that hasn't yet synced objects from zone B, so > > the zones would converge on zone B's objects] > > - Strategy to clean up already-deleted buckets, ie 'radosgw-admin > > bucket stale-instances rm' command > > > > Summary: > > - Add a process for 'deferred bucket deletion', where local bucket > > instance metadata is removed and the bucket index/data are scheduled for > > later 'bucket gc'. A new 'bucket gc list' is stored in omap and > > processed by a worker similar to existing gc. > > - For metadata sync, the metadata log format needs to be extended > > to distinguish between normal writes and deletion events on bucket > > instances. When metadata sync encounters a bucket instance deletion, it > > runs 'deferred bucket deletion'. > > - Data sync on the bucket needs to avoid creating new objects while > > bucket gc is running. > > > > mdlog: > > - entries must distinguish between Write, Remove, and Delete (where > > Delete implies gc of associated data) > > - a 'bucket rm' Deletes its bucket instance metadata > > - a 'bucket reshard' Removes the old bucket instance because the > > new bucket instance still owns the data > > > > Bucket gc list: > > - stored in omap in the log pool > > - sharded over multiple objects > > - each entry encodes RGWBucketInfo (needed to delete objects after > > bucket instance is deleted) > > > > Bucket index: > > - add REMOVE_ONLY flag to bucket index to prevent object creation > > from racing with bucket gc > > > > Deferred bucket delete: > > - flag bucket index shards as REMOVE_ONLY > > - add to 'bucket gc' list (entry includes encoded RGWBucketInfo) > > *requires access to existing bucket instance metadata* > > - delete local bucket instance (add Delete entry to mdlog) > > > > Metadata sync: > > - must serialize sync of mdlog entries with the same metadata key, > > to preserve order of Writes vs Removes/Deletes > > - can skip Writes if they're followed by Removes/Deletes > > - on Delete of bucket instance, run deferred bucket delete > > - backward compatibility: what to do with mdlog entries that don't > > specify Write/Remove/Delete? > > - for bucket instance: assume write (because we never deleted > > them before upgrade), and just try to fetch > > - for other metadata: use existing strategy to fetch remote > > metadata, and remove local metadata on 404/ENOENT > > > > Bucket sync: > > - bucket sync first fetches bucket instance - on ENOENT, exit > > bucket sync with success > > - if sync_object() returns REMOVE_ONLY error from bucket index, > > exit bucket sync with success > > - read/fetch bucket instance metadata before taking lease to avoid > > recreating bucket sync status objects > > > > Bucket gc worker: > > - for each bucket in gc list: > > - decode RGWBucketInfo > > - delete each object in bucket [should we GC tail objects or > > delete inline?] > > Do you store progress anywhere? Object removal should probably avoid > touching the bucket indexes. What if there are a zillion objects in > the bucket? You don't want it to start from the beginning if the > process was stopped in the middle. I think not involving the gc would > be more efficient and less risky as otherwise you might be risking > flooding the gc omaps, but you'll need to keep a marker somewhere. > Also, will need to do this asynchronously with a configurable number > of concurrent operations. > > > - delete incomplete multiparts > > - delete bucket index objects > > - delete bucket sync status objects > > > > radosgw-admin bucket stale-instances rm: > > - run deferred bucket delete on each bucket instance that: > > - does not have an associated bucket entrypoint > > You need to be careful not to remove newly created bucket instances. > > > - has a bucket id matching its bucket marker? (has not been > > resharded) > > Why? You can check for any bucket instance whether it's current by > going to the corresponding bucket meta. In any case all of this is > racy so need to put appropriate guards. > > > - must be safe to run on any zone after upgrade -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309