Re: What is the meaning of size and min_size for erasure-coded pools?

Maciej Puzio <mkp37215@xxxxxxxxx> · Wed, 9 May 2018 20:24:17 -0500

I still don't understand why I get any clean PGs in the erasure-coded
pool, when with two OSDs down there is no more redundancy, and
therefore all PGs should be undersized (or so I think).
I repeated the experiment by bringing two remaining OSDs online, and
then killing them, and got results similar to the previous test. But
this time I observed the process more closely.

Example showing state changes for one of the PGs [OSD assignment in brackets]:
When all OSDs were online: [3,2,4,0,1], state: active+clean
Initially after OSDs 3 and 4 were killed: [x,2,x,0,1], state:
active+undersized+degraded
After some time (OSD 3 and 4 still offline): [0,2,0,0,1], state: active+clean
('x' means that some large number was listed; I assume this meant
original OSD unavailable)

Another PG:
When all OSDs were online: [0,3,2,1,4], state: active+clean
Initially after OSDs 3 and 4 were killed: [0,x,2,1,x], state:
active+undersized+degraded
After some time (OSD 3 and 4 still offline): [0,1,2,1,1], state:
active+clean+remapped
Note: This PG became remapped, the previous one did not.

Does this mean that these PGs now have 5 chunks, of which 3 are stored
on one OSD?
Perhaps I am missing something, but could this arrangement be
redundant? And how can a non-redundant state be considered clean?
By the way, I am using crush-failure-domain=host, and I have one OSD per host.

On the good side, I have no complaints about how the replicated
metadata pool operates. Unfortunately, I will not be able to replicate
data in my future production cluster.
One more thing, I figured out that "degraded" means "undersized and
contains data".

Thanks

Maciej Puzio

On Wed, May 9, 2018 at 7:07 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Wed, May 9, 2018 at 4:37 PM, Maciej Puzio <mkp37215@xxxxxxxxx> wrote:
>> My setup consists of two pools on 5 OSDs, and is intended for cephfs:
>> 1. erasure-coded data pool: k=3, m=2, size=5, min_size=3 (originally
>> 4), number of PGs=128
>> 2. replicated metadata pool: size=3, min_size=2, number of PGs=100
>>
>> When all OSDs were online, all PGs from both pools has status
>> active+clean. After killing two of five OSDs (and changing min_size to
>> 3), all metadata pool PGs remained active+clean, and of 128 data pool
>> PGs, 3 remained active+clean, 11 became active+clean+remapped, and the
>> rest became active+undersized, active+undersized+remapped,
>> active+undersized+degraded or active+undersized+degraded+remapped,
>> seemingly at random.
>>
>> After some time one of remaining three OSD nodes lost network
>> connectivity (due to ceph-unrelated bug in virtio_net; this toy setup
>> sure is becoming a bug motherlode!). The node was rebooted, ceph
>> cluster become accessible again (with 3 out of 5 OSDs online, as
>> before), and three active+clean data pool PGs now became
>> active+clean+remapped, while the rest of PGs seems to have kept their
>> previous status.
>
> That collection makes sense if you have a replicated pool as well as
> an EC one, then. They represent different states for the PG; see
> http://docs.ceph.com/docs/jewel/rados/operations/pg-states/ and
> they're not random but the collection of which PG is in which set of
> states is determined by how CRUSH placement and the failures interact,
> and CRUSH is a pseudo-random algorithm, so... ;)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com