Re: rgw multisite: mdlog transactions for metadata sync

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 4/16/19 1:20 PM, Yehuda Sadeh-Weinraub wrote:
On Mon, Apr 15, 2019 at 12:41 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:

On 4/15/19 1:25 PM, Yehuda Sadeh-Weinraub wrote:
On Mon, Apr 15, 2019 at 10:09 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
Hi Yehuda,

I'm working on a design for the cleanup of deleted buckets in multisite.
To do this, I'd like to trigger some actions on secondary zones when
metadata sync sees a bucket instance get deleted. The first obstacle
here is that metadata sync can't differentiate between writes and
deletes due to how the mdlog transactions are structured.

RGWMetadataManager::pre_modify() writes an mdlog entry with the status
of MDLOG_STATUS_WRITE/REMOVE, and post_modify() completes the
transaction with a MDLOG_STATUS_COMPLETE entry. So only the 'prepare'
step knows what kind of op it was, and sync can't reliably associate a
COMPLETE with its prepare because mdlog trimming may have deleted the
prepare.

In RGWMetaSyncSingleEntryCR, metadata sync filters out any entries that
aren't MDLOG_STATUS_COMPLETE, and tries to infer the deletes based on
whether RGWReadRemoteMetadataCR returns ENOENT. This part should be
explicit if it's going to trigger further object deletion, so I'd like
to add a separate 'op' field to the mdlog for this.
Yeah, makes sense.

I'm also wondering if this separate 'prepare' entry is worth writing,
given that we ignore it during sync - I'd like to remove it if we can,
the same way I proposed for the bucket index log in
https://github.com/ceph/ceph/pull/26755. Do you see a reason to keep
either of those?

The prepare was created for cases where we fail to write the final
complete log entry (e.g., due to crash) but after the metadata entry
was already written. In the original design we'd identify cases where
we didn't get a complete after the prepare and recover. However, this
was never implemented, so losing the prepare as it is doesn't make
things worse than they are. The question is how do we solve the case
that I described. In the data case it's not an issue, as the bucket
index has the dir_suggest mechanism that deals with it and implicitly
solves the problem. In the metadata case there is no such mechanism.
One way to go there maybe would be by maintaining a journal of
metadata changes. In any case I do think the prepare can and needs to
go.
Okay, thanks. Maybe we could write these prepare entries to a separate
index in omap, and have the complete step delete them. That way we could
avoid replicating the prepares to other zones, and we could replay just
the incomplete ops on startup instead of having to list the entire mdlog.
Another option is to do a two phase commit on the metadata entries,
and recover similarly to the dir_suggest mechanism. Have an attribute
(local to the zone) on the metadata object that signifies whether it's
'complete'. So a write sequence to a metadata entry on a zone that
needs to log metadata entries will be:
  - write entry (+attr.state=prepare)
  - update mdlog
  - update entry (attr.state=complete)

When reading the metadata object we'll need to read the status
attribute, and if the state is not 'complete' then we'll rewrite the
entry (with appropriate guard).

This should scale better than maintaining omap for prepares. The cost
is extra attribute and another write on metadata entry update. We
currently update the mdlog synchronously, so asynchronously updating
the attribute shouldn't add much latency. The flip side is that there
will be no tracking of what metadata entries are in 'prepare' state.
I'm not sure if that's a requirement.

In the context of this bucket cleanup project, the main requirement would be to notice deletes that didn't commit - so using attributes on the metadata object itself wouldn't help with that case. The dir_suggest mechanism has a similar issue, where it only tries try to recover after a read. That's why I think we'd need some independent way to list the uncommitted changes. I imagine we'll also need to support multiple pending operations on an object, where we might prepare both and then complete/abort each depending on what cls_version returns.

Since metadata isn't mutated that frequently, I'm less concerned about overhead from the transactions. My suggestion also added an extra omap delete to the commit step, though it could happen in the same osd op as the mdlog commit (and probably be async).




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux