Re: why my cluster become unavailable (min_size of pool)

"hzwulibin" <hzwulibin@xxxxxxxxx> · Thu, 26 Nov 2015 15:54:04 +0800

Hi, Sage

I has a question about min_size of pool.

The default value of min_size is 2, but in this setting, when two OSDs are down(mean two replicas lost) at same time, the IO will be blocked.
We want to set the min_size to 1 in our production environment as we think it's normal case when two OSDs are down(sure on different host) at same time.

So is there anypotential problem of this setting?

We use 0.80.10 version.

Thanks!

------------------				 
hzwulibin
2015-11-26

-------------------------------------------------------------
发件人："hzwulibin"<hzwulibin@xxxxxxxxx>
发送日期：2015-11-23 09:00
收件人：Sage Weil,Haomai Wang
抄送：ceph-devel
主题：Re: why my cluster become unavailable

Hi, Sage

Thanks! Will try it when next testing!

------------------				 
hzwulibin
2015-11-23

-------------------------------------------------------------
发件人：Sage Weil <sage@xxxxxxxxxxxx>
发送日期：2015-11-22 01:49
收件人：Haomai Wang
抄送：Libin Wu,ceph-devel
主题：Re: why my cluster become unavailable

On Sun, 22 Nov 2015, Haomai Wang wrote:
> On Thu, Nov 19, 2015 at 11:26 PM, Libin Wu <hzwulibin@xxxxxxxxx> wrote:
> > Hi, cepher
> >
> > I have a cluster of 6 OSD server, every server has 8 OSDs.
> >
> > I out 4 OSDs on every server, then my client io is blocking.
> >
> > I reboot my client and then create a new rbd device, but the new
> > device also can't write io.
> >
> > Yeah, i understand that some data may lost as threee replicas of some
> > object were lost, but why the cluster become unavailable?
> >
> > There 80 incomplete pg and 4 down+incomplete pg.
> >
> > Any solution i could solve the problem?
> 
> Yes, if you doesn't have a special crushmap to control the data
> replcement policy, pg will lack of necessary metadata to boot. If need
> to readd outed osds or force remove pg which is incomplete(hope it's
> just a test).

Is min_size 2 or 1?  Reducing it to 1 will generally clear some of the 
incomplete pgs.  Just remember to raise it back to 2 after the cluster 
recovers.

sage

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f