Re: Why does Erasure-pool not support omap?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



thanks for the explanation, Jason

Matt

On Thu, Oct 26, 2017 at 11:49 AM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
> On Thu, Oct 26, 2017 at 11:35 AM, Matt Benjamin <mbenjami@xxxxxxxxxx> wrote:
>> I had the understanding that RGW's use of class methods, which is also
>> extensive, would be compatible with this approach.  Is there reason to
>> doubt that?
>
> I don't see any "cls_cxx_read" calls in RGW's class methods. Like I
> said, assuming the omap class object calls remain synchronous on an
> EC-backed pool, omap won't be an issue for RBD but reads will be an
> issue (v1 directory, RBD image id objects, and the object map).
>
>> Matt
>>
>> On Thu, Oct 26, 2017 at 11:08 AM, Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>>> On Thu, Oct 26, 2017 at 10:26 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>> On Thu, 26 Oct 2017, Gregory Farnum wrote:
>>>>> On Wed, Oct 25, 2017 at 8:57 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>>>>> > On 10/25/2017 05:16 AM, Sage Weil wrote:
>>>>> >>
>>>>> >> Hi Xingguo,
>>>>> >>
>>>>> >> On Wed, 25 Oct 2017, xie.xingguo@xxxxxxxxxx wrote:
>>>>> >>>
>>>>> >>>        I wonder why erasure-pools can not support omap currently.
>>>>> >>>
>>>>> >>>        The simplest way for erasure-pools to support omap I can figure
>>>>> >>> out would be duplicating omap on every shard.
>>>>> >>>
>>>>> >>>        It is because it consumes too much space when k + m gets bigger?
>>>>> >>
>>>>> >>
>>>>> >> Right.  There isn't a nontrivial way to actually erasure code it, and
>>>>> >> duplicating on every shard is inefficient.
>>>>> >>
>>>>> >> One reasonableish approach would be to replicate the omap data on m+1
>>>>> >> shards.  But it's a bit of work to implement and nobody has done it.
>>>>> >>
>>>>> >> I can't remember if there were concerns with this approach or it was just
>>>>> >> a matter of time/resources... Josh? Greg?
>>>>> >
>>>>> >
>>>>> > It restricts us to erasure codes like reed-solomon where a subset of shards
>>>>> > are always updated. I think this is a reasonable trade-off though, it's just
>>>>> > a matter of implementing it. We haven't written
>>>>> > up the required peering changes, but they did not seem too difficult to
>>>>> > implement.
>>>>> >
>>>>> > Some notes on the approach are here - just think of 'replicating omap'
>>>>> > as a partial write to m+1 shards:
>>>>> >
>>>>> > http://pad.ceph.com/p/ec-partial-writes
>>>>>
>>>>> Yeah. To expand a bit on why this only works for Reed-Solomon,
>>>>> consider the minimum and appropriate number of copies — and the actual
>>>>> shard placement — for local recovery codes. :/ We were unable to
>>>>> generalize for that (or indeed for SHEC, IIRC) when whiteboarding.
>>>>>
>>>>> I'm also still nervous that this might do weird things to our recovery
>>>>> and availability patterns in more complex failure cases, but I don't
>>>>> have any concrete issues.
>>>>
>>>> It seems like the minimum-viable variation of this is that we don't change
>>>> any of the peering or logging behavior at all, but just send the omap
>>>> writes to all shards (like any other write), but only the annointed shards
>>>> persist.
>>>>
>>>> That leaves lots of room for improvement, but it makes the feature work
>>>> without many changes, and means we can drop the specialness around rbd
>>>> images in EC pools.
>>>
>>> Potentially negative since RBD relies heavily on class methods.
>>> Assuming the cls_cxx_map_XYZ operations will never require async work,
>>> there is still the issue with methods that perform straight read/write
>>> calls.
>>>
>>>> Then we can make CephFS and RGW issue warnings (or even refuse) to use EC
>>>> pools for their metadata or index pools since it's strictly less efficient
>>>> than replicated to avoid user mistakes.
>>>>
>>>> ?
>>>>
>>>> sage
>>>
>>>
>>>
>>> --
>>> Jason
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>>
>> Matt Benjamin
>> Red Hat, Inc.
>> 315 West Huron Street, Suite 140A
>> Ann Arbor, Michigan 48103
>>
>> http://www.redhat.com/en/technologies/storage
>>
>> tel.  734-821-5101
>> fax.  734-769-8938
>> cel.  734-216-5309
>
>
>
> --
> Jason



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux