I am an admin in a research lab looking for a cluster storage solution, and a newbie to ceph. I have setup a mini toy cluster on some VMs, to familiarize myself with ceph and to test failure scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a replicated pool for metadata, and CephFS on top of them, using default settings wherever possible. I mounted the filesystem on another machine and verified that it worked. I then killed two OSD VMs with an expectation that the data pool will still be available, even if in a degraded state, but I found that this was not the case, and that the pool became inaccessible for reading and writing. I listed PGs (ceph pg ls) and found the majority of PGs in an incomplete state. I then found that the pool had size=5 and min_size=4. Where did the value 4 come from, I do not know. This is what I found in the ceph documentation in relation to min_size and resiliency of erasure-coded pools: 1. According to http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values size and min_size are for replicated pools only. 2. According to the same document, for erasure-coded pools the number of OSDs that are allowed to fail without losing data equals the number of coding chunks (m=2 in my case). Of course data loss is not the same thing as lack of access, but why these two things happen at different redundancy levels, by default? 3. The same document states that that no object in the data pool will receive I/O with fewer than min_size replicas. This refers to replicas, and taken together with #1, appear not to apply to erasured-coded pools. But in fact it does, and the default min_size != k causes a surprising behavior. 4. According to http://docs.ceph.com/docs/master/rados/operations/pg-states/ , reducing min_size may allow recovery of an erasure-coded pool. This advice was deemed unhelpful and removed from documentation (commit 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only one confused. I followed the advice #4 and reduced min_size to 3. Lo and behold, the pool became accessible, and I could read the data previously stored, and write new one. This appears to contradict #1, but at least it works. The look at ceph pg ls revealed another mystery, though. Most of PGs were now active+undersized, often with ...+degraded and/or remapped, but a few were active+clean or active+clean+remapped. Why? I would expect all PGs to be in the same state (perhaps active+undersized+degraded?) I apologize if this behavior turns out to be expected and straightforward to experienced ceph users, or if I missed some documentation that explains this clearly. My goal is to put about 500 TB on ceph or another cluster storage system, and I find these issues confusing and worrisome. Helpful and competent replies will be much appreciated. Please note that my questions are about erasure-coded pools, and not about replicated pools. Thank you Maciej Puzio _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com