On Thu, Oct 26, 2017 at 9:21 AM Josh Durgin <jdurgin@xxxxxxxxxx> wrote: > > On 10/26/2017 07:26 AM, Sage Weil wrote: > > On Thu, 26 Oct 2017, Gregory Farnum wrote: > >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: > >>> On 10/25/2017 05:16 AM, Sage Weil wrote: > >>>> > >>>> Hi Xingguo, > >>>> > >>>> On Wed, 25 Oct 2017, xie.xingguo@xxxxxxxxxx wrote: > >>>>> > >>>>> I wonder why erasure-pools can not support omap currently. > >>>>> > >>>>> The simplest way for erasure-pools to support omap I can figure > >>>>> out would be duplicating omap on every shard. > >>>>> > >>>>> It is because it consumes too much space when k + m gets bigger? > >>>> > >>>> > >>>> Right. There isn't a nontrivial way to actually erasure code it, and > >>>> duplicating on every shard is inefficient. > >>>> > >>>> One reasonableish approach would be to replicate the omap data on m+1 > >>>> shards. But it's a bit of work to implement and nobody has done it. > >>>> > >>>> I can't remember if there were concerns with this approach or it was just > >>>> a matter of time/resources... Josh? Greg? > >>> > >>> > >>> It restricts us to erasure codes like reed-solomon where a subset of shards > >>> are always updated. I think this is a reasonable trade-off though, it's just > >>> a matter of implementing it. We haven't written > >>> up the required peering changes, but they did not seem too difficult to > >>> implement. > >>> > >>> Some notes on the approach are here - just think of 'replicating omap' > >>> as a partial write to m+1 shards: > >>> > >>> http://pad.ceph.com/p/ec-partial-writes > >> > >> Yeah. To expand a bit on why this only works for Reed-Solomon, > >> consider the minimum and appropriate number of copies — and the actual > >> shard placement — for local recovery codes. :/ We were unable to > >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. > >> > >> I'm also still nervous that this might do weird things to our recovery > >> and availability patterns in more complex failure cases, but I don't > >> have any concrete issues. > > > > It seems like the minimum-viable variation of this is that we don't change > > any of the peering or logging behavior at all, but just send the omap > > writes to all shards (like any other write), but only the annointed shards > > persist. > > > > That leaves lots of room for improvement, but it makes the feature work > > without many changes, and means we can drop the specialness around rbd > > images in EC pools. > > Won't that still require recovery and read path changes? I also don't understand at all how this would work. Can you expand, Sage? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html