Re: rgw multisite: mdlog transactions for metadata sync

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Mon, 15 Apr 2019 10:25:09 -0700

On Mon, Apr 15, 2019 at 10:09 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
> Hi Yehuda,
>
> I'm working on a design for the cleanup of deleted buckets in multisite.
> To do this, I'd like to trigger some actions on secondary zones when
> metadata sync sees a bucket instance get deleted. The first obstacle
> here is that metadata sync can't differentiate between writes and
> deletes due to how the mdlog transactions are structured.
>
> RGWMetadataManager::pre_modify() writes an mdlog entry with the status
> of MDLOG_STATUS_WRITE/REMOVE, and post_modify() completes the
> transaction with a MDLOG_STATUS_COMPLETE entry. So only the 'prepare'
> step knows what kind of op it was, and sync can't reliably associate a
> COMPLETE with its prepare because mdlog trimming may have deleted the
> prepare.
>
> In RGWMetaSyncSingleEntryCR, metadata sync filters out any entries that
> aren't MDLOG_STATUS_COMPLETE, and tries to infer the deletes based on
> whether RGWReadRemoteMetadataCR returns ENOENT. This part should be
> explicit if it's going to trigger further object deletion, so I'd like
> to add a separate 'op' field to the mdlog for this.

Yeah, makes sense.

>
> I'm also wondering if this separate 'prepare' entry is worth writing,
> given that we ignore it during sync - I'd like to remove it if we can,
> the same way I proposed for the bucket index log in
> https://github.com/ceph/ceph/pull/26755. Do you see a reason to keep
> either of those?
>

The prepare was created for cases where we fail to write the final
complete log entry (e.g., due to crash) but after the metadata entry
was already written. In the original design we'd identify cases where
we didn't get a complete after the prepare and recover. However, this
was never implemented, so losing the prepare as it is doesn't make
things worse than they are. The question is how do we solve the case
that I described. In the data case it's not an issue, as the bucket
index has the dir_suggest mechanism that deals with it and implicitly
solves the problem. In the metadata case there is no such mechanism.
One way to go there maybe would be by maintaining a journal of
metadata changes. In any case I do think the prepare can and needs to
go.

Yehuda