thanks for the explanation, Jason Matt On Thu, Oct 26, 2017 at 11:49 AM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: > On Thu, Oct 26, 2017 at 11:35 AM, Matt Benjamin <mbenjami@xxxxxxxxxx> wrote: >> I had the understanding that RGW's use of class methods, which is also >> extensive, would be compatible with this approach. Is there reason to >> doubt that? > > I don't see any "cls_cxx_read" calls in RGW's class methods. Like I > said, assuming the omap class object calls remain synchronous on an > EC-backed pool, omap won't be an issue for RBD but reads will be an > issue (v1 directory, RBD image id objects, and the object map). > >> Matt >> >> On Thu, Oct 26, 2017 at 11:08 AM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote: >>> On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>>> On Thu, 26 Oct 2017, Gregory Farnum wrote: >>>>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: >>>>> > On 10/25/2017 05:16 AM, Sage Weil wrote: >>>>> >> >>>>> >> Hi Xingguo, >>>>> >> >>>>> >> On Wed, 25 Oct 2017, xie.xingguo@xxxxxxxxxx wrote: >>>>> >>> >>>>> >>> I wonder why erasure-pools can not support omap currently. >>>>> >>> >>>>> >>> The simplest way for erasure-pools to support omap I can figure >>>>> >>> out would be duplicating omap on every shard. >>>>> >>> >>>>> >>> It is because it consumes too much space when k + m gets bigger? >>>>> >> >>>>> >> >>>>> >> Right. There isn't a nontrivial way to actually erasure code it, and >>>>> >> duplicating on every shard is inefficient. >>>>> >> >>>>> >> One reasonableish approach would be to replicate the omap data on m+1 >>>>> >> shards. But it's a bit of work to implement and nobody has done it. >>>>> >> >>>>> >> I can't remember if there were concerns with this approach or it was just >>>>> >> a matter of time/resources... Josh? Greg? >>>>> > >>>>> > >>>>> > It restricts us to erasure codes like reed-solomon where a subset of shards >>>>> > are always updated. I think this is a reasonable trade-off though, it's just >>>>> > a matter of implementing it. We haven't written >>>>> > up the required peering changes, but they did not seem too difficult to >>>>> > implement. >>>>> > >>>>> > Some notes on the approach are here - just think of 'replicating omap' >>>>> > as a partial write to m+1 shards: >>>>> > >>>>> > http://pad.ceph.com/p/ec-partial-writes >>>>> >>>>> Yeah. To expand a bit on why this only works for Reed-Solomon, >>>>> consider the minimum and appropriate number of copies — and the actual >>>>> shard placement — for local recovery codes. :/ We were unable to >>>>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. >>>>> >>>>> I'm also still nervous that this might do weird things to our recovery >>>>> and availability patterns in more complex failure cases, but I don't >>>>> have any concrete issues. >>>> >>>> It seems like the minimum-viable variation of this is that we don't change >>>> any of the peering or logging behavior at all, but just send the omap >>>> writes to all shards (like any other write), but only the annointed shards >>>> persist. >>>> >>>> That leaves lots of room for improvement, but it makes the feature work >>>> without many changes, and means we can drop the specialness around rbd >>>> images in EC pools. >>> >>> Potentially negative since RBD relies heavily on class methods. >>> Assuming the cls_cxx_map_XYZ operations will never require async work, >>> there is still the issue with methods that perform straight read/write >>> calls. >>> >>>> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC >>>> pools for their metadata or index pools since it's strictly less efficient >>>> than replicated to avoid user mistakes. >>>> >>>> ? >>>> >>>> sage >>> >>> >>> >>> -- >>> Jason >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> >> Matt Benjamin >> Red Hat, Inc. >> 315 West Huron Street, Suite 140A >> Ann Arbor, Michigan 48103 >> >> http://www.redhat.com/en/technologies/storage >> >> tel. 734-821-5101 >> fax. 734-769-8938 >> cel. 734-216-5309 > > > > -- > Jason -- Matt Benjamin Red Hat, Inc. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://www.redhat.com/en/technologies/storage tel. 734-821-5101 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html