Re: Why does Erasure-pool not support omap?

Jason Dillaman <jdillama@xxxxxxxxxx> · Thu, 26 Oct 2017 11:49:15 -0400

On Thu, Oct 26, 2017 at 11:35 AM, Matt Benjamin <mbenjami@xxxxxxxxxx> wrote:
> I had the understanding that RGW's use of class methods, which is also
> extensive, would be compatible with this approach.  Is there reason to
> doubt that?

I don't see any "cls_cxx_read" calls in RGW's class methods. Like I
said, assuming the omap class object calls remain synchronous on an
EC-backed pool, omap won't be an issue for RBD but reads will be an
issue (v1 directory, RBD image id objects, and the object map).

> Matt
>
> On Thu, Oct 26, 2017 at 11:08 AM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>> On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>> On Thu, 26 Oct 2017, Gregory Farnum wrote:
>>>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>>>> > On 10/25/2017 05:16 AM, Sage Weil wrote:
>>>> >>
>>>> >> Hi Xingguo,
>>>> >>
>>>> >> On Wed, 25 Oct 2017, xie.xingguo@xxxxxxxxxx wrote:
>>>> >>>
>>>> >>>        I wonder why erasure-pools can not support omap currently.
>>>> >>>
>>>> >>>        The simplest way for erasure-pools to support omap I can figure
>>>> >>> out would be duplicating omap on every shard.
>>>> >>>
>>>> >>>        It is because it consumes too much space when k + m gets bigger?
>>>> >>
>>>> >>
>>>> >> Right.  There isn't a nontrivial way to actually erasure code it, and
>>>> >> duplicating on every shard is inefficient.
>>>> >>
>>>> >> One reasonableish approach would be to replicate the omap data on m+1
>>>> >> shards.  But it's a bit of work to implement and nobody has done it.
>>>> >>
>>>> >> I can't remember if there were concerns with this approach or it was just
>>>> >> a matter of time/resources... Josh? Greg?
>>>> >
>>>> >
>>>> > It restricts us to erasure codes like reed-solomon where a subset of shards
>>>> > are always updated. I think this is a reasonable trade-off though, it's just
>>>> > a matter of implementing it. We haven't written
>>>> > up the required peering changes, but they did not seem too difficult to
>>>> > implement.
>>>> >
>>>> > Some notes on the approach are here - just think of 'replicating omap'
>>>> > as a partial write to m+1 shards:
>>>> >
>>>> > http://pad.ceph.com/p/ec-partial-writes
>>>>
>>>> Yeah. To expand a bit on why this only works for Reed-Solomon,
>>>> consider the minimum and appropriate number of copies — and the actual
>>>> shard placement — for local recovery codes. :/ We were unable to
>>>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
>>>>
>>>> I'm also still nervous that this might do weird things to our recovery
>>>> and availability patterns in more complex failure cases, but I don't
>>>> have any concrete issues.
>>>
>>> It seems like the minimum-viable variation of this is that we don't change
>>> any of the peering or logging behavior at all, but just send the omap
>>> writes to all shards (like any other write), but only the annointed shards
>>> persist.
>>>
>>> That leaves lots of room for improvement, but it makes the feature work
>>> without many changes, and means we can drop the specialness around rbd
>>> images in EC pools.
>>
>> Potentially negative since RBD relies heavily on class methods.
>> Assuming the cls_cxx_map_XYZ operations will never require async work,
>> there is still the issue with methods that perform straight read/write
>> calls.
>>
>>> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC
>>> pools for their metadata or index pools since it's strictly less efficient
>>> than replicated to avoid user mistakes.
>>>
>>> ?
>>>
>>> sage
>>
>>
>>
>> --
>> Jason
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-821-5101
> fax.  734-769-8938
> cel.  734-216-5309

-- 
Jason
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html