Re: What is the meaning of size and min_size for erasure-coded pools?

Vasu Kulkarni <vakulkar@xxxxxxxxxx> · Tue, 8 May 2018 12:42:02 -0700

On Tue, May 8, 2018 at 12:07 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> On Tue, May 8, 2018 at 7:35 PM, Vasu Kulkarni <vakulkar@xxxxxxxxxx> wrote:
>> On Mon, May 7, 2018 at 2:26 PM, Maciej Puzio <mkp37215@xxxxxxxxx> wrote:
>>> I am an admin in a research lab looking for a cluster storage
>>> solution, and a newbie to ceph. I have setup a mini toy cluster on
>>> some VMs, to familiarize myself with ceph and to test failure
>>> scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs
>>> (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a
>>> replicated pool for metadata, and CephFS on top of them, using default
>>> settings wherever possible. I mounted the filesystem on another
>>> machine and verified that it worked.
>>>
>>> I then killed two OSD VMs with an expectation that the data pool will
>>> still be available, even if in a degraded state, but I found that this
>>> was not the case, and that the pool became inaccessible for reading
>>> and writing. I listed PGs (ceph pg ls) and found the majority of PGs
>>> in an incomplete state. I then found that the pool had size=5 and
>>> min_size=4. Where did the value 4 come from, I do not know.
>>>
>>> This is what I found in the ceph documentation in relation to min_size
>>> and resiliency of erasure-coded pools:
>>>
>>> 1. According to
>>> http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values
>>> size and min_size are for replicated pools only.
>>> 2. According to the same document, for erasure-coded pools the number
>>> of OSDs that are allowed to fail without losing data equals the number
>>> of coding chunks (m=2 in my case). Of course data loss is not the same
>>> thing as lack of access, but why these two things happen at different
>>> redundancy levels, by default?
>>> 3. The same document states that that no object in the data pool will
>>> receive I/O with fewer than min_size replicas. This refers to
>>> replicas, and taken together with #1, appear not to apply to
>>> erasured-coded pools. But in fact it does, and the default min_size !=
>>> k causes a surprising behavior.
>>> 4. According to
>>> http://docs.ceph.com/docs/master/rados/operations/pg-states/ ,
>>> reducing min_size may allow recovery of an erasure-coded pool. This
>>> advice was deemed unhelpful and removed from documentation (commit
>>> 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit
>>> ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only
>>> one confused.
>>
>>
>> you bring up good inconsistency that needs to be addressed, afaik,only
>> m value is important
>> for ec pools, i am not sure if the *replicated* metadata pool is
>> somehow causing min_size
>> variance in your experiment to work. when we create replicated pool it
>> has option for min size
>> and for ec pool it is the m value.
>
> See https://github.com/ceph/ceph/pull/8008 for the reason why min_size
> defaults to k+1 on ec pools.

So this looks like its happening by default per ec pool, unless user
is changing the pool min_size.
 probably this should be left unchanged and we could document it? It
is bit confusing with
coding chunks.

>
> Cheers, Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com