On Mon, May 7, 2018 at 2:26 PM, Maciej Puzio <mkp37215@xxxxxxxxx> wrote: > I am an admin in a research lab looking for a cluster storage > solution, and a newbie to ceph. I have setup a mini toy cluster on > some VMs, to familiarize myself with ceph and to test failure > scenarios. I am using ceph 12.2.4 on Ubuntu 18.04. I created 5 OSDs > (one OSD per VM), an erasure-coded pool for data (k=3, m=2), a > replicated pool for metadata, and CephFS on top of them, using default > settings wherever possible. I mounted the filesystem on another > machine and verified that it worked. > > I then killed two OSD VMs with an expectation that the data pool will > still be available, even if in a degraded state, but I found that this > was not the case, and that the pool became inaccessible for reading > and writing. I listed PGs (ceph pg ls) and found the majority of PGs > in an incomplete state. I then found that the pool had size=5 and > min_size=4. Where did the value 4 come from, I do not know. > > This is what I found in the ceph documentation in relation to min_size > and resiliency of erasure-coded pools: > > 1. According to > http://docs.ceph.com/docs/luminous/rados/operations/pools/ the values > size and min_size are for replicated pools only. > 2. According to the same document, for erasure-coded pools the number > of OSDs that are allowed to fail without losing data equals the number > of coding chunks (m=2 in my case). Of course data loss is not the same > thing as lack of access, but why these two things happen at different > redundancy levels, by default? > 3. The same document states that that no object in the data pool will > receive I/O with fewer than min_size replicas. This refers to > replicas, and taken together with #1, appear not to apply to > erasured-coded pools. But in fact it does, and the default min_size != > k causes a surprising behavior. > 4. According to > http://docs.ceph.com/docs/master/rados/operations/pg-states/ , > reducing min_size may allow recovery of an erasure-coded pool. This > advice was deemed unhelpful and removed from documentation (commit > 9549943761d1cdc16d72e2b604bf1f89d12b5e13), but then re-added (commit > ac6123d7a6d27775eec0a152c00e0ff75b36bd60). I guess I am not the only > one confused. you bring up good inconsistency that needs to be addressed, afaik,only m value is important for ec pools, i am not sure if the *replicated* metadata pool is somehow causing min_size variance in your experiment to work. when we create replicated pool it has option for min size and for ec pool it is the m value. > > I followed the advice #4 and reduced min_size to 3. Lo and behold, the > pool became accessible, and I could read the data previously stored, > and write new one. This appears to contradict #1, but at least it > works. The look at ceph pg ls revealed another mystery, though. Most > of PGs were now active+undersized, often with ...+degraded and/or > remapped, but a few were active+clean or active+clean+remapped. Why? I > would expect all PGs to be in the same state (perhaps > active+undersized+degraded?) > > I apologize if this behavior turns out to be expected and > straightforward to experienced ceph users, or if I missed some > documentation that explains this clearly. My goal is to put about 500 > TB on ceph or another cluster storage system, and I find these issues > confusing and worrisome. Helpful and competent replies will be much > appreciated. Please note that my questions are about erasure-coded > pools, and not about replicated pools. > > Thank you > > Maciej Puzio > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com