Re: Why does Erasure-pool not support omap?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 1 Nov 2017 13:27:54 -0700

On Mon, Oct 30, 2017 at 7:26 PM Sage Weil <sweil@xxxxxxxxxx> wrote:
>
> On Mon, 30 Oct 2017, Gregory Farnum wrote:
> > On Thu, Oct 26, 2017 at 9:21 AM Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> > >
> > > On 10/26/2017 07:26 AM, Sage Weil wrote:
> > > > On Thu, 26 Oct 2017, Gregory Farnum wrote:
> > > >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> > > >>> On 10/25/2017 05:16 AM, Sage Weil wrote:
> > > >>>>
> > > >>>> Hi Xingguo,
> > > >>>>
> > > >>>> On Wed, 25 Oct 2017, xie.xingguo@xxxxxxxxxx wrote:
> > > >>>>>
> > > >>>>>         I wonder why erasure-pools can not support omap currently.
> > > >>>>>
> > > >>>>>         The simplest way for erasure-pools to support omap I can figure
> > > >>>>> out would be duplicating omap on every shard.
> > > >>>>>
> > > >>>>>         It is because it consumes too much space when k + m gets bigger?
> > > >>>>
> > > >>>>
> > > >>>> Right.  There isn't a nontrivial way to actually erasure code it, and
> > > >>>> duplicating on every shard is inefficient.
> > > >>>>
> > > >>>> One reasonableish approach would be to replicate the omap data on m+1
> > > >>>> shards.  But it's a bit of work to implement and nobody has done it.
> > > >>>>
> > > >>>> I can't remember if there were concerns with this approach or it was just
> > > >>>> a matter of time/resources... Josh? Greg?
> > > >>>
> > > >>>
> > > >>> It restricts us to erasure codes like reed-solomon where a subset of shards
> > > >>> are always updated. I think this is a reasonable trade-off though, it's just
> > > >>> a matter of implementing it. We haven't written
> > > >>> up the required peering changes, but they did not seem too difficult to
> > > >>> implement.
> > > >>>
> > > >>> Some notes on the approach are here - just think of 'replicating omap'
> > > >>> as a partial write to m+1 shards:
> > > >>>
> > > >>> http://pad.ceph.com/p/ec-partial-writes
> > > >>
> > > >> Yeah. To expand a bit on why this only works for Reed-Solomon,
> > > >> consider the minimum and appropriate number of copies — and the actual
> > > >> shard placement — for local recovery codes. :/ We were unable to
> > > >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
> > > >>
> > > >> I'm also still nervous that this might do weird things to our recovery
> > > >> and availability patterns in more complex failure cases, but I don't
> > > >> have any concrete issues.
> > > >
> > > > It seems like the minimum-viable variation of this is that we don't change
> > > > any of the peering or logging behavior at all, but just send the omap
> > > > writes to all shards (like any other write), but only the annointed shards
> > > > persist.
> > > >
> > > > That leaves lots of room for improvement, but it makes the feature work
> > > > without many changes, and means we can drop the specialness around rbd
> > > > images in EC pools.
> > >
> > > Won't that still require recovery and read path changes?
> >
> >
> > I also don't understand at all how this would work. Can you expand, Sage?
>
> On write, the ECTransaction collects the omap operation.  We either send
> it to all shards or slide just the omap key/value data for shard_id > k.
> For shard_id <= k, we write the omap data to the local object.  We still
> send the write op to all shards with attrs and pg log entries.
>
> We take care to always select the first acting shard as the primary, which
> will ensure a shard_id <= k if we go active, such that cls operations and
> omap reads can be handled locally.
>
> Hmm, I think the problem is with rollback, though.  IIRC the code is
> structured around rollback and not rollforward, and omap writes are blind.
>
> So, not trivial, but it doesn't require any of the stuff we were talking
> about before where we'd only send writes to a subset of shards and have
> incomplete pg logs on each shard.

Okay, so by "don't change any of the peering or logging behavior at
all", you meant we didn't have to do any of the stuff that starts
accounting for differing pg versions on each shard's object. But of
course we still need to make a number of changes to the peering and
recovery code so we select the right shards.

We *could* do that, but it seems like another of the "sorta there"
features we end up regretting. I guess the apparent hurry to get it in
before other EC pool enhancements are ready is to avoid the rbd header
pool? How much effort does that actually save? (Even the minimal
peering+recovery changes here will take a fair bit of doing and a lot
of QA qualification.)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html