Re: rgw multisite: mdlog transactions for metadata sync

Casey Bodley <cbodley@xxxxxxxxxx> · Tue, 16 Apr 2019 15:41:27 -0400

On 4/16/19 1:20 PM, Yehuda Sadeh-Weinraub wrote:
On Mon, Apr 15, 2019 at 12:41 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:

On 4/15/19 1:25 PM, Yehuda Sadeh-Weinraub wrote:
On Mon, Apr 15, 2019 at 10:09 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
Hi Yehuda,

I'm working on a design for the cleanup of deleted buckets in multisite.
To do this, I'd like to trigger some actions on secondary zones when
metadata sync sees a bucket instance get deleted. The first obstacle
here is that metadata sync can't differentiate between writes and
deletes due to how the mdlog transactions are structured.

RGWMetadataManager::pre_modify() writes an mdlog entry with the status
of MDLOG_STATUS_WRITE/REMOVE, and post_modify() completes the
transaction with a MDLOG_STATUS_COMPLETE entry. So only the 'prepare'
step knows what kind of op it was, and sync can't reliably associate a
COMPLETE with its prepare because mdlog trimming may have deleted the
prepare.

In RGWMetaSyncSingleEntryCR, metadata sync filters out any entries that
aren't MDLOG_STATUS_COMPLETE, and tries to infer the deletes based on
whether RGWReadRemoteMetadataCR returns ENOENT. This part should be
explicit if it's going to trigger further object deletion, so I'd like
to add a separate 'op' field to the mdlog for this.
Yeah, makes sense.

I'm also wondering if this separate 'prepare' entry is worth writing,
given that we ignore it during sync - I'd like to remove it if we can,
the same way I proposed for the bucket index log in
https://github.com/ceph/ceph/pull/26755. Do you see a reason to keep
either of those?

The prepare was created for cases where we fail to write the final
complete log entry (e.g., due to crash) but after the metadata entry
was already written. In the original design we'd identify cases where
we didn't get a complete after the prepare and recover. However, this
was never implemented, so losing the prepare as it is doesn't make
things worse than they are. The question is how do we solve the case
that I described. In the data case it's not an issue, as the bucket
index has the dir_suggest mechanism that deals with it and implicitly
solves the problem. In the metadata case there is no such mechanism.
One way to go there maybe would be by maintaining a journal of
metadata changes. In any case I do think the prepare can and needs to
go.
Okay, thanks. Maybe we could write these prepare entries to a separate
index in omap, and have the complete step delete them. That way we could
avoid replicating the prepares to other zones, and we could replay just
the incomplete ops on startup instead of having to list the entire mdlog.
Another option is to do a two phase commit on the metadata entries,
and recover similarly to the dir_suggest mechanism. Have an attribute
(local to the zone) on the metadata object that signifies whether it's
'complete'. So a write sequence to a metadata entry on a zone that
needs to log metadata entries will be:
  - write entry (+attr.state=prepare)
  - update mdlog
  - update entry (attr.state=complete)

When reading the metadata object we'll need to read the status
attribute, and if the state is not 'complete' then we'll rewrite the
entry (with appropriate guard).

This should scale better than maintaining omap for prepares. The cost
is extra attribute and another write on metadata entry update. We
currently update the mdlog synchronously, so asynchronously updating
the attribute shouldn't add much latency. The flip side is that there
will be no tracking of what metadata entries are in 'prepare' state.
I'm not sure if that's a requirement.

In the context of this bucket cleanup project, the main requirement 
would be to notice deletes that didn't commit - so using attributes on 
the metadata object itself wouldn't help with that case. The dir_suggest 
mechanism has a similar issue, where it only tries try to recover after 
a read. That's why I think we'd need some independent way to list the 
uncommitted changes. I imagine we'll also need to support multiple 
pending operations on an object, where we might prepare both and then 
complete/abort each depending on what cls_version returns.

Since metadata isn't mutated that frequently, I'm less concerned about 
overhead from the transactions. My suggestion also added an extra omap 
delete to the commit step, though it could happen in the same osd op as 
the mdlog commit (and probably be async).