On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Thu, 26 Oct 2017, Gregory Farnum wrote: >> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: >> > On 10/25/2017 05:16 AM, Sage Weil wrote: >> >> >> >> Hi Xingguo, >> >> >> >> On Wed, 25 Oct 2017, xie.xingguo@xxxxxxxxxx wrote: >> >>> >> >>> I wonder why erasure-pools can not support omap currently. >> >>> >> >>> The simplest way for erasure-pools to support omap I can figure >> >>> out would be duplicating omap on every shard. >> >>> >> >>> It is because it consumes too much space when k + m gets bigger? >> >> >> >> >> >> Right. There isn't a nontrivial way to actually erasure code it, and >> >> duplicating on every shard is inefficient. >> >> >> >> One reasonableish approach would be to replicate the omap data on m+1 >> >> shards. But it's a bit of work to implement and nobody has done it. >> >> >> >> I can't remember if there were concerns with this approach or it was just >> >> a matter of time/resources... Josh? Greg? >> > >> > >> > It restricts us to erasure codes like reed-solomon where a subset of shards >> > are always updated. I think this is a reasonable trade-off though, it's just >> > a matter of implementing it. We haven't written >> > up the required peering changes, but they did not seem too difficult to >> > implement. >> > >> > Some notes on the approach are here - just think of 'replicating omap' >> > as a partial write to m+1 shards: >> > >> > http://pad.ceph.com/p/ec-partial-writes >> >> Yeah. To expand a bit on why this only works for Reed-Solomon, >> consider the minimum and appropriate number of copies — and the actual >> shard placement — for local recovery codes. :/ We were unable to >> generalize for that (or indeed for SHEC, IIRC) when whiteboarding. >> >> I'm also still nervous that this might do weird things to our recovery >> and availability patterns in more complex failure cases, but I don't >> have any concrete issues. > > It seems like the minimum-viable variation of this is that we don't change > any of the peering or logging behavior at all, but just send the omap > writes to all shards (like any other write), but only the annointed shards > persist. > > That leaves lots of room for improvement, but it makes the feature work > without many changes, and means we can drop the specialness around rbd > images in EC pools. Potentially negative since RBD relies heavily on class methods. Assuming the cls_cxx_map_XYZ operations will never require async work, there is still the issue with methods that perform straight read/write calls. > Then we can make CephFS and RGW issue warnings (or even refuse) to use EC > pools for their metadata or index pools since it's strictly less efficient > than replicated to avoid user mistakes. > > ? > > sage -- Jason -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html