Re: What is the meaning of size and min_size for erasure-coded pools?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 09 May 2018 21:49:23 +0000

On Tue, May 8, 2018 at 2:16 PM Maciej Puzio <mkp37215@xxxxxxxxx> wrote:
Thank you everyone for your replies. However, I feel that at least

part of the discussion deviated from the topic of my original post. As

I wrote before, I am dealing with a toy cluster, whose purpose is not

to provide a resilient storage, but to evaluate ceph and its behavior

in the event of a failure, with particular attention paid to

worst-case scenarios. This cluster is purposely minimal, and is built

on VMs running on my workstation, all OSDs storing data on a single

SSD. That's definitely not a production system.

I am not asking for advice on how to build resilient clusters, not at

this point. I asked some questions about specific things that I

noticed during my tests, and that I was not able to find explained in

ceph documentation. Dan van der Ster wrote:

> See https://github.com/ceph/ceph/pull/8008 for the reason why min_size defaults to k+1 on ec pools.

That's a good point, but I am wondering why are reads also blocked

when number of OSDs falls down to k? What if total number of OSDs in a

pool (n) is larger than k+m, should the min_size then be k(+1) or

n-m(+1)?

In any case, since min_size can be easily changed, then I guess this

is not an implementation issue, but rather a documentation issue.

Which leaves these my questions still unanswered:

After killing m OSDs and setting min_size=k most of PGs were now

active+undersized, often with ...+degraded and/or remapped, but a few

were active+clean or active+clean+remapped. Why? I would expect all

PGs to be in the same state (perhaps active+undersized+degraded?).

Is this mishmash of PG states normal? If not, would I have avoided it

if I created the pool with min_size=k=3 from the start? In other

words, does min_size influence the assignment of PGs to OSDs? Or is it

only used to force I/O shutdown in the event of OSDs failures?

active+clean does not make a lot of sense if every PG really was 3+2. But perhaps you had a 3x replicated pool or something hanging out as well from your deployment tool?
The active+clean+remapped means that a PG was somehow lucky enough to have an existing "stray" copy on one of the OSDs that it has decided to use to bring it back up to the right number of copies, even though they certainly won't match the proper failure domains.
The min_size in relation to the k+m values won't have any direct impact here, although they might indirectly affect it by changing how quickly stray PGs get deleted.
-Greg

Thank you very much

Maciej Puzio

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com