Re: rgw multisite: mdlog transactions for metadata sync

Casey Bodley <cbodley@xxxxxxxxxx> · Mon, 15 Apr 2019 15:41:07 -0400

On 4/15/19 1:25 PM, Yehuda Sadeh-Weinraub wrote:
On Mon, Apr 15, 2019 at 10:09 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
Hi Yehuda,

I'm working on a design for the cleanup of deleted buckets in multisite.
To do this, I'd like to trigger some actions on secondary zones when
metadata sync sees a bucket instance get deleted. The first obstacle
here is that metadata sync can't differentiate between writes and
deletes due to how the mdlog transactions are structured.

RGWMetadataManager::pre_modify() writes an mdlog entry with the status
of MDLOG_STATUS_WRITE/REMOVE, and post_modify() completes the
transaction with a MDLOG_STATUS_COMPLETE entry. So only the 'prepare'
step knows what kind of op it was, and sync can't reliably associate a
COMPLETE with its prepare because mdlog trimming may have deleted the
prepare.

In RGWMetaSyncSingleEntryCR, metadata sync filters out any entries that
aren't MDLOG_STATUS_COMPLETE, and tries to infer the deletes based on
whether RGWReadRemoteMetadataCR returns ENOENT. This part should be
explicit if it's going to trigger further object deletion, so I'd like
to add a separate 'op' field to the mdlog for this.
Yeah, makes sense.

I'm also wondering if this separate 'prepare' entry is worth writing,
given that we ignore it during sync - I'd like to remove it if we can,
the same way I proposed for the bucket index log in
https://github.com/ceph/ceph/pull/26755. Do you see a reason to keep
either of those?

The prepare was created for cases where we fail to write the final
complete log entry (e.g., due to crash) but after the metadata entry
was already written. In the original design we'd identify cases where
we didn't get a complete after the prepare and recover. However, this
was never implemented, so losing the prepare as it is doesn't make
things worse than they are. The question is how do we solve the case
that I described. In the data case it's not an issue, as the bucket
index has the dir_suggest mechanism that deals with it and implicitly
solves the problem. In the metadata case there is no such mechanism.
One way to go there maybe would be by maintaining a journal of
metadata changes. In any case I do think the prepare can and needs to
go.

Okay, thanks. Maybe we could write these prepare entries to a separate 
index in omap, and have the complete step delete them. That way we could 
avoid replicating the prepares to other zones, and we could replay just 
the incomplete ops on startup instead of having to list the entire mdlog.

In the meantime, I'll raise a pr to stop logging the prepares, and add a 
field for the op.