Re: Why does Erasure-pool not support omap?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 30 Oct 2017 15:48:52 -0700



On Thu, Oct 26, 2017 at 9:21 AM Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>
> On 10/26/2017 07:26 AM, Sage Weil wrote:
> > On Thu, 26 Oct 2017, Gregory Farnum wrote:
> >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> >>> On 10/25/2017 05:16 AM, Sage Weil wrote:
> >>>>
> >>>> Hi Xingguo,
> >>>>
> >>>> On Wed, 25 Oct 2017, xie.xingguo@xxxxxxxxxx wrote:
> >>>>>
> >>>>>         I wonder why erasure-pools can not support omap currently.
> >>>>>
> >>>>>         The simplest way for erasure-pools to support omap I can figure
> >>>>> out would be duplicating omap on every shard.
> >>>>>
> >>>>>         It is because it consumes too much space when k + m gets bigger?
> >>>>
> >>>>
> >>>> Right.  There isn't a nontrivial way to actually erasure code it, and
> >>>> duplicating on every shard is inefficient.
> >>>>
> >>>> One reasonableish approach would be to replicate the omap data on m+1
> >>>> shards.  But it's a bit of work to implement and nobody has done it.
> >>>>
> >>>> I can't remember if there were concerns with this approach or it was just
> >>>> a matter of time/resources... Josh? Greg?
> >>>
> >>>
> >>> It restricts us to erasure codes like reed-solomon where a subset of shards
> >>> are always updated. I think this is a reasonable trade-off though, it's just
> >>> a matter of implementing it. We haven't written
> >>> up the required peering changes, but they did not seem too difficult to
> >>> implement.
> >>>
> >>> Some notes on the approach are here - just think of 'replicating omap'
> >>> as a partial write to m+1 shards:
> >>>
> >>> http://pad.ceph.com/p/ec-partial-writes
> >>
> >> Yeah. To expand a bit on why this only works for Reed-Solomon,
> >> consider the minimum and appropriate number of copies — and the actual
> >> shard placement — for local recovery codes. :/ We were unable to
> >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
> >>
> >> I'm also still nervous that this might do weird things to our recovery
> >> and availability patterns in more complex failure cases, but I don't
> >> have any concrete issues.
> >
> > It seems like the minimum-viable variation of this is that we don't change
> > any of the peering or logging behavior at all, but just send the omap
> > writes to all shards (like any other write), but only the annointed shards
> > persist.
> >
> > That leaves lots of room for improvement, but it makes the feature work
> > without many changes, and means we can drop the specialness around rbd
> > images in EC pools.
>
> Won't that still require recovery and read path changes?


I also don't understand at all how this would work. Can you expand, Sage?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html