Re: rgw: bucket deletion in multisite

Casey Bodley <cbodley@xxxxxxxxxx> · Wed, 1 May 2019 15:58:59 -0400

On 4/30/19 9:16 PM, Yehuda Sadeh-Weinraub wrote:
See my comments below. In general the plan looks good to me.

Yehuda

On Tue, Apr 30, 2019 at 1:42 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
Hi rgw folks, this is a rough design for cleanup of deleted buckets in
multisite. I would love some review/feedback.

Motivation:
      - Bucket deletion in a multisite configuration does not delete
bucket instance metadata, bucket sync status, or bucket index objects on
any zone. This allows bucket sync on each zone to finish processing
object deletions and (hopefully) converge on empty.

Requirements:
      - Remove all objects associated with deleted buckets in a timely
manner:
          - bucket instance metadata, bucket index shards, and bucket
sync status
          - all object data
      - Does not rely on bucket sync to delete all objects [zone A may
delete an empty bucket that hasn't yet synced objects from zone B, so
the zones would converge on zone B's objects]
      - Strategy to clean up already-deleted buckets, ie 'radosgw-admin
bucket stale-instances rm' command

Summary:
      - Add a process for 'deferred bucket deletion', where local bucket
instance metadata is removed and the bucket index/data are scheduled for
later 'bucket gc'. A new 'bucket gc list' is stored in omap and
processed by a worker similar to existing gc.
      - For metadata sync, the metadata log format needs to be extended
to distinguish between normal writes and deletion events on bucket
instances. When metadata sync encounters a bucket instance deletion, it
runs 'deferred bucket deletion'.
      - Data sync on the bucket needs to avoid creating new objects while
bucket gc is running.

mdlog:
      - entries must distinguish between Write, Remove, and Delete (where
Delete implies gc of associated data)
      - a 'bucket rm' Deletes its bucket instance metadata
      - a 'bucket reshard' Removes the old bucket instance because the
new bucket instance still owns the data

Bucket gc list:
      - stored in omap in the log pool
      - sharded over multiple objects
      - each entry encodes RGWBucketInfo (needed to delete objects after
bucket instance is deleted)

Bucket index:
      - add REMOVE_ONLY flag to bucket index to prevent object creation
from racing with bucket gc

Deferred bucket delete:
      - flag bucket index shards as REMOVE_ONLY
      - add to 'bucket gc' list (entry includes encoded RGWBucketInfo)
*requires access to existing bucket instance metadata*
      - delete local bucket instance (add Delete entry to mdlog)

Metadata sync:
      - must serialize sync of mdlog entries with the same metadata key,
to preserve order of Writes vs Removes/Deletes
          - can skip Writes if they're followed by Removes/Deletes
      - on Delete of bucket instance, run deferred bucket delete
      - backward compatibility: what to do with mdlog entries that don't
specify Write/Remove/Delete?
          - for bucket instance: assume write (because we never deleted
them before upgrade), and just try to fetch
          - for other metadata: use existing strategy to fetch remote
metadata, and remove local metadata on 404/ENOENT

Bucket sync:
      - bucket sync first fetches bucket instance - on ENOENT, exit
bucket sync with success
      - if sync_object() returns REMOVE_ONLY error from bucket index,
exit bucket sync with success
      - read/fetch bucket instance metadata before taking lease to avoid
recreating bucket sync status objects

Bucket gc worker:
      - for each bucket in gc list:
          - decode RGWBucketInfo
          - delete each object in bucket [should we GC tail objects or
delete inline?]
Do you store progress anywhere? Object removal should probably avoid
touching the bucket indexes. What if there are a zillion objects in
the bucket? You don't want it to start from the beginning if the
process was stopped in the middle. I think not involving the gc would
be more efficient and less risky as otherwise you might be risking
flooding the gc omaps, but you'll need to keep a marker somewhere.
Also, will need to do this asynchronously with a configurable number
of concurrent operations.

Okay. I was planning to rely on the bucket index to track progress. The 
assumption was that the buckets would have very few objects in general, 
because they had to be empty on one zone in order to delete them. 
Similarly, I was hoping to avoid the complexity of concurrency within a 
bucket index shard.

But because a) zillion-object cases are possible when sync is far enough 
behind, and b) large single-sharded buckets are more likely in 
multisite, I agree that we do need these optimizations here.

And by avoiding bucket index ops during bucket gc, the proposed 
REMOVE_ONLY flag on the bucket index could be more general (ie READONLY) 
and easier to implement.

          - delete incomplete multiparts
          - delete bucket index objects
          - delete bucket sync status objects

radosgw-admin bucket stale-instances rm:
      - run deferred bucket delete on each bucket instance that:
          - does not have an associated bucket entrypoint
You need to be careful not to remove newly created bucket instances.

          - has a bucket id matching its bucket marker? (has not been
resharded)
Why? You can check for any bucket instance whether it's current by
going to the corresponding bucket meta. In any case all of this is
racy so need to put appropriate guards.

Yeah, this is gonna be tricky - I don't think either of my two bullet 
points are correct here. The important part is to distinguish between a 
non-current bucket instance that was resharded, and one that was 
removed. It looks we can rely on the RGWBucketInfo::reshard_status for 
this (status=DONE for reshard, and NONE for remove).