Re: rgw: bucket deletion in multisite

Yehuda Sadeh-Weinraub <ysadehwe@xxxxxxxxxx> · Tue, 30 Apr 2019 18:16:23 -0700

See my comments below. In general the plan looks good to me.

Yehuda

On Tue, Apr 30, 2019 at 1:42 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
> Hi rgw folks, this is a rough design for cleanup of deleted buckets in
> multisite. I would love some review/feedback.
>
> Motivation:
>      - Bucket deletion in a multisite configuration does not delete
> bucket instance metadata, bucket sync status, or bucket index objects on
> any zone. This allows bucket sync on each zone to finish processing
> object deletions and (hopefully) converge on empty.
>
> Requirements:
>      - Remove all objects associated with deleted buckets in a timely
> manner:
>          - bucket instance metadata, bucket index shards, and bucket
> sync status
>          - all object data
>      - Does not rely on bucket sync to delete all objects [zone A may
> delete an empty bucket that hasn't yet synced objects from zone B, so
> the zones would converge on zone B's objects]
>      - Strategy to clean up already-deleted buckets, ie 'radosgw-admin
> bucket stale-instances rm' command
>
> Summary:
>      - Add a process for 'deferred bucket deletion', where local bucket
> instance metadata is removed and the bucket index/data are scheduled for
> later 'bucket gc'. A new 'bucket gc list' is stored in omap and
> processed by a worker similar to existing gc.
>      - For metadata sync, the metadata log format needs to be extended
> to distinguish between normal writes and deletion events on bucket
> instances. When metadata sync encounters a bucket instance deletion, it
> runs 'deferred bucket deletion'.
>      - Data sync on the bucket needs to avoid creating new objects while
> bucket gc is running.
>
> mdlog:
>      - entries must distinguish between Write, Remove, and Delete (where
> Delete implies gc of associated data)
>      - a 'bucket rm' Deletes its bucket instance metadata
>      - a 'bucket reshard' Removes the old bucket instance because the
> new bucket instance still owns the data
>
> Bucket gc list:
>      - stored in omap in the log pool
>      - sharded over multiple objects
>      - each entry encodes RGWBucketInfo (needed to delete objects after
> bucket instance is deleted)
>
> Bucket index:
>      - add REMOVE_ONLY flag to bucket index to prevent object creation
> from racing with bucket gc
>
> Deferred bucket delete:
>      - flag bucket index shards as REMOVE_ONLY
>      - add to 'bucket gc' list (entry includes encoded RGWBucketInfo)
> *requires access to existing bucket instance metadata*
>      - delete local bucket instance (add Delete entry to mdlog)
>
> Metadata sync:
>      - must serialize sync of mdlog entries with the same metadata key,
> to preserve order of Writes vs Removes/Deletes
>          - can skip Writes if they're followed by Removes/Deletes
>      - on Delete of bucket instance, run deferred bucket delete
>      - backward compatibility: what to do with mdlog entries that don't
> specify Write/Remove/Delete?
>          - for bucket instance: assume write (because we never deleted
> them before upgrade), and just try to fetch
>          - for other metadata: use existing strategy to fetch remote
> metadata, and remove local metadata on 404/ENOENT
>
> Bucket sync:
>      - bucket sync first fetches bucket instance - on ENOENT, exit
> bucket sync with success
>      - if sync_object() returns REMOVE_ONLY error from bucket index,
> exit bucket sync with success
>      - read/fetch bucket instance metadata before taking lease to avoid
> recreating bucket sync status objects
>
> Bucket gc worker:
>      - for each bucket in gc list:
>          - decode RGWBucketInfo
>          - delete each object in bucket [should we GC tail objects or
> delete inline?]

Do you store progress anywhere? Object removal should probably avoid
touching the bucket indexes. What if there are a zillion objects in
the bucket? You don't want it to start from the beginning if the
process was stopped in the middle. I think not involving the gc would
be more efficient and less risky as otherwise you might be risking
flooding the gc omaps, but you'll need to keep a marker somewhere.
Also, will need to do this asynchronously with a configurable number
of concurrent operations.

>          - delete incomplete multiparts
>          - delete bucket index objects
>          - delete bucket sync status objects
>
> radosgw-admin bucket stale-instances rm:
>      - run deferred bucket delete on each bucket instance that:
>          - does not have an associated bucket entrypoint

You need to be careful not to remove newly created bucket instances.

>          - has a bucket id matching its bucket marker? (has not been
> resharded)

Why? You can check for any bucket instance whether it's current by
going to the corresponding bucket meta. In any case all of this is
racy so need to put appropriate guards.

>      - must be safe to run on any zone after upgrade