Re: rgw multisite: mdlog transactions for metadata sync

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Tue, 16 Apr 2019 10:20:37 -0700

On Mon, Apr 15, 2019 at 12:41 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
>
> On 4/15/19 1:25 PM, Yehuda Sadeh-Weinraub wrote:
> > On Mon, Apr 15, 2019 at 10:09 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> >> Hi Yehuda,
> >>
> >> I'm working on a design for the cleanup of deleted buckets in multisite.
> >> To do this, I'd like to trigger some actions on secondary zones when
> >> metadata sync sees a bucket instance get deleted. The first obstacle
> >> here is that metadata sync can't differentiate between writes and
> >> deletes due to how the mdlog transactions are structured.
> >>
> >> RGWMetadataManager::pre_modify() writes an mdlog entry with the status
> >> of MDLOG_STATUS_WRITE/REMOVE, and post_modify() completes the
> >> transaction with a MDLOG_STATUS_COMPLETE entry. So only the 'prepare'
> >> step knows what kind of op it was, and sync can't reliably associate a
> >> COMPLETE with its prepare because mdlog trimming may have deleted the
> >> prepare.
> >>
> >> In RGWMetaSyncSingleEntryCR, metadata sync filters out any entries that
> >> aren't MDLOG_STATUS_COMPLETE, and tries to infer the deletes based on
> >> whether RGWReadRemoteMetadataCR returns ENOENT. This part should be
> >> explicit if it's going to trigger further object deletion, so I'd like
> >> to add a separate 'op' field to the mdlog for this.
> > Yeah, makes sense.
> >
> >> I'm also wondering if this separate 'prepare' entry is worth writing,
> >> given that we ignore it during sync - I'd like to remove it if we can,
> >> the same way I proposed for the bucket index log in
> >> https://github.com/ceph/ceph/pull/26755. Do you see a reason to keep
> >> either of those?
> >>
> > The prepare was created for cases where we fail to write the final
> > complete log entry (e.g., due to crash) but after the metadata entry
> > was already written. In the original design we'd identify cases where
> > we didn't get a complete after the prepare and recover. However, this
> > was never implemented, so losing the prepare as it is doesn't make
> > things worse than they are. The question is how do we solve the case
> > that I described. In the data case it's not an issue, as the bucket
> > index has the dir_suggest mechanism that deals with it and implicitly
> > solves the problem. In the metadata case there is no such mechanism.
> > One way to go there maybe would be by maintaining a journal of
> > metadata changes. In any case I do think the prepare can and needs to
> > go.
>
> Okay, thanks. Maybe we could write these prepare entries to a separate
> index in omap, and have the complete step delete them. That way we could
> avoid replicating the prepares to other zones, and we could replay just
> the incomplete ops on startup instead of having to list the entire mdlog.

Another option is to do a two phase commit on the metadata entries,
and recover similarly to the dir_suggest mechanism. Have an attribute
(local to the zone) on the metadata object that signifies whether it's
'complete'. So a write sequence to a metadata entry on a zone that
needs to log metadata entries will be:
 - write entry (+attr.state=prepare)
 - update mdlog
 - update entry (attr.state=complete)

When reading the metadata object we'll need to read the status
attribute, and if the state is not 'complete' then we'll rewrite the
entry (with appropriate guard).

This should scale better than maintaining omap for prepares. The cost
is extra attribute and another write on metadata entry update. We
currently update the mdlog synchronously, so asynchronously updating
the attribute shouldn't add much latency. The flip side is that there
will be no tracking of what metadata entries are in 'prepare' state.
I'm not sure if that's a requirement.

Yehuda

>
> In the meantime, I'll raise a pr to stop logging the prepares, and add a
> field for the op.
>