Re: Why does Erasure-pool not support omap?

Josh Durgin <jdurgin@xxxxxxxxxx> · Thu, 26 Oct 2017 09:21:04 -0700

On 10/26/2017 07:26 AM, Sage Weil wrote:
On Thu, 26 Oct 2017, Gregory Farnum wrote:
On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
On 10/25/2017 05:16 AM, Sage Weil wrote:

Hi Xingguo,

On Wed, 25 Oct 2017, xie.xingguo@xxxxxxxxxx wrote:

        I wonder why erasure-pools can not support omap currently.

        The simplest way for erasure-pools to support omap I can figure
out would be duplicating omap on every shard.

        It is because it consumes too much space when k + m gets bigger?

Right.  There isn't a nontrivial way to actually erasure code it, and
duplicating on every shard is inefficient.

One reasonableish approach would be to replicate the omap data on m+1
shards.  But it's a bit of work to implement and nobody has done it.

I can't remember if there were concerns with this approach or it was just
a matter of time/resources... Josh? Greg?

It restricts us to erasure codes like reed-solomon where a subset of shards
are always updated. I think this is a reasonable trade-off though, it's just
a matter of implementing it. We haven't written
up the required peering changes, but they did not seem too difficult to
implement.

Some notes on the approach are here - just think of 'replicating omap'
as a partial write to m+1 shards:

http://pad.ceph.com/p/ec-partial-writes

Yeah. To expand a bit on why this only works for Reed-Solomon,
consider the minimum and appropriate number of copies — and the actual
shard placement — for local recovery codes. :/ We were unable to
generalize for that (or indeed for SHEC, IIRC) when whiteboarding.

I'm also still nervous that this might do weird things to our recovery
and availability patterns in more complex failure cases, but I don't
have any concrete issues.

It seems like the minimum-viable variation of this is that we don't change
any of the peering or logging behavior at all, but just send the omap
writes to all shards (like any other write), but only the annointed shards
persist.

That leaves lots of room for improvement, but it makes the feature work
without many changes, and means we can drop the specialness around rbd
images in EC pools.

Won't that still require recovery and read path changes?

Then we can make CephFS and RGW issue warnings (or even refuse) to use EC
pools for their metadata or index pools since it's strictly less efficient
than replicated to avoid user mistakes.

If this is only for rbd, we might as well store k+m copies since there's
so little omap data.

I agree cephfs and rgw continue to refuse to use EC for metadata, since
their omap use gets far to large and is in the data path.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html