Re: rgw: bucket deletion in multisite

Matt Benjamin <mbenjami@xxxxxxxxxx> · Tue, 30 Apr 2019 21:47:40 -0400

"Not involve bi."  yes, albeit, we're trying to update that process
and its assumptions, in parallel.

Matt

On Tue, Apr 30, 2019 at 9:17 PM Yehuda Sadeh-Weinraub
<ysadehwe@xxxxxxxxxx> wrote:
>
> See my comments below. In general the plan looks good to me.
>
> Yehuda
>
> On Tue, Apr 30, 2019 at 1:42 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> >
> > Hi rgw folks, this is a rough design for cleanup of deleted buckets in
> > multisite. I would love some review/feedback.
> >
> > Motivation:
> >      - Bucket deletion in a multisite configuration does not delete
> > bucket instance metadata, bucket sync status, or bucket index objects on
> > any zone. This allows bucket sync on each zone to finish processing
> > object deletions and (hopefully) converge on empty.
> >
> > Requirements:
> >      - Remove all objects associated with deleted buckets in a timely
> > manner:
> >          - bucket instance metadata, bucket index shards, and bucket
> > sync status
> >          - all object data
> >      - Does not rely on bucket sync to delete all objects [zone A may
> > delete an empty bucket that hasn't yet synced objects from zone B, so
> > the zones would converge on zone B's objects]
> >      - Strategy to clean up already-deleted buckets, ie 'radosgw-admin
> > bucket stale-instances rm' command
> >
> > Summary:
> >      - Add a process for 'deferred bucket deletion', where local bucket
> > instance metadata is removed and the bucket index/data are scheduled for
> > later 'bucket gc'. A new 'bucket gc list' is stored in omap and
> > processed by a worker similar to existing gc.
> >      - For metadata sync, the metadata log format needs to be extended
> > to distinguish between normal writes and deletion events on bucket
> > instances. When metadata sync encounters a bucket instance deletion, it
> > runs 'deferred bucket deletion'.
> >      - Data sync on the bucket needs to avoid creating new objects while
> > bucket gc is running.
> >
> > mdlog:
> >      - entries must distinguish between Write, Remove, and Delete (where
> > Delete implies gc of associated data)
> >      - a 'bucket rm' Deletes its bucket instance metadata
> >      - a 'bucket reshard' Removes the old bucket instance because the
> > new bucket instance still owns the data
> >
> > Bucket gc list:
> >      - stored in omap in the log pool
> >      - sharded over multiple objects
> >      - each entry encodes RGWBucketInfo (needed to delete objects after
> > bucket instance is deleted)
> >
> > Bucket index:
> >      - add REMOVE_ONLY flag to bucket index to prevent object creation
> > from racing with bucket gc
> >
> > Deferred bucket delete:
> >      - flag bucket index shards as REMOVE_ONLY
> >      - add to 'bucket gc' list (entry includes encoded RGWBucketInfo)
> > *requires access to existing bucket instance metadata*
> >      - delete local bucket instance (add Delete entry to mdlog)
> >
> > Metadata sync:
> >      - must serialize sync of mdlog entries with the same metadata key,
> > to preserve order of Writes vs Removes/Deletes
> >          - can skip Writes if they're followed by Removes/Deletes
> >      - on Delete of bucket instance, run deferred bucket delete
> >      - backward compatibility: what to do with mdlog entries that don't
> > specify Write/Remove/Delete?
> >          - for bucket instance: assume write (because we never deleted
> > them before upgrade), and just try to fetch
> >          - for other metadata: use existing strategy to fetch remote
> > metadata, and remove local metadata on 404/ENOENT
> >
> > Bucket sync:
> >      - bucket sync first fetches bucket instance - on ENOENT, exit
> > bucket sync with success
> >      - if sync_object() returns REMOVE_ONLY error from bucket index,
> > exit bucket sync with success
> >      - read/fetch bucket instance metadata before taking lease to avoid
> > recreating bucket sync status objects
> >
> > Bucket gc worker:
> >      - for each bucket in gc list:
> >          - decode RGWBucketInfo
> >          - delete each object in bucket [should we GC tail objects or
> > delete inline?]
>
> Do you store progress anywhere? Object removal should probably avoid
> touching the bucket indexes. What if there are a zillion objects in
> the bucket? You don't want it to start from the beginning if the
> process was stopped in the middle. I think not involving the gc would
> be more efficient and less risky as otherwise you might be risking
> flooding the gc omaps, but you'll need to keep a marker somewhere.
> Also, will need to do this asynchronously with a configurable number
> of concurrent operations.
>
> >          - delete incomplete multiparts
> >          - delete bucket index objects
> >          - delete bucket sync status objects
> >
> > radosgw-admin bucket stale-instances rm:
> >      - run deferred bucket delete on each bucket instance that:
> >          - does not have an associated bucket entrypoint
>
> You need to be careful not to remove newly created bucket instances.
>
> >          - has a bucket id matching its bucket marker? (has not been
> > resharded)
>
> Why? You can check for any bucket instance whether it's current by
> going to the corresponding bucket meta. In any case all of this is
> racy so need to put appropriate guards.
>
> >      - must be safe to run on any zone after upgrade

-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309