What is the meaning of size and min_size for erasure-coded pools?

Maciej Puzio <mkp37215@xxxxxxxxx> · Mon, 7 May 2018 16:26:50 -0500

I am an admin in a research lab looking for a cluster storage
solution, and a newbie to ceph. I have setup a mini toy cluster on
some VMs, to familiarize myself with ceph and to test failure
scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs
(one OSD per VM), an erasure-coded pool for data (k=3, m=2), a
replicated pool for metadata, and CephFS on top of them, using default
settings wherever possible. I mounted the filesystem on another
machine and verified that it worked.

I then killed two OSD VMs with an expectation that the data pool will
still be available, even if in a degraded state, but I found that this
was not the case, and that the pool became inaccessible for reading
and writing. I listed PGs (ceph pg ls) and found the majority of PGs
in an incomplete state. I then found that the pool had size=5 and
min_size=4. Where did the value 4 come from, I do not know.

This is what I found in the ceph documentation in relation to min_size
and resiliency of erasure-coded pools:

1. According to
http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values
size and min_size are for replicated pools only.
2. According to the same document, for erasure-coded pools the number
of OSDs that are allowed to fail without losing data equals the number
of coding chunks (m=2 in my case). Of course data loss is not the same
thing as lack of access, but why these two things happen at different
redundancy levels, by default?
3. The same document states that that no object in the data pool will
receive I/O with fewer than min_size replicas. This refers to
replicas, and taken together with #1, appear not to apply to
erasured-coded pools. But in fact it does, and the default min_size !=
k causes a surprising behavior.
4. According to
http://docs.ceph.com/docs/master/rados/operations/pg-states/ ,
reducing min_size may allow recovery of an erasure-coded pool. This
advice was deemed unhelpful and removed from documentation (commit
9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit
ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only
one confused.

I followed the advice #4 and reduced min_size to 3. Lo and behold, the
pool became accessible, and I could read the data previously stored,
and write new one. This appears to contradict #1, but at least it
works. The look at ceph pg ls revealed another mystery, though. Most
of PGs were now active+undersized, often with ...+degraded and/or
remapped, but a few were active+clean or active+clean+remapped. Why? I
would expect all PGs to be in the same state (perhaps
active+undersized+degraded?)

I apologize if this behavior turns out to be expected and
straightforward to experienced ceph users, or if I missed some
documentation that explains this clearly. My goal is to put about 500
TB on ceph or another cluster storage system, and I find these issues
confusing and worrisome. Helpful and competent replies will be much
appreciated. Please note that my questions are about erasure-coded
pools, and not about replicated pools.

Thank you

Maciej Puzio
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com