Re: Worst thing that can happen if I have size= 2

"huxiaoyu@xxxxxxxxxxxx" <huxiaoyu@xxxxxxxxxxxx> · Thu, 4 Feb 2021 22:30:43 +0100

>IMO with a cluster this size, you should not ever mark out any OSDs --
>rather, you should leave the PGs degraded, replace the disk (keep the
>same OSD ID), then recover those objects to the new disk.
>Or, keep it <40% used (which sounds like a waste).

Dear Dan,

I particularly like your idea of " leave the PGs degraded, and replace the disk with the same OSD ID ". This is a wonderful thing i really want to do. 

Could you please share some more details on how to achieve this, or some scripts already being tested?

thanks a lot,

Samuel

huxiaoyu@xxxxxxxxxxxx

From: Dan van der Ster
Date: 2021-02-04 11:57
To: Mario Giammarco
CC: Ceph Users
Subject:  Re: Worst thing that can happen if I have size= 2
On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco <mgiammarco@xxxxxxxxx> wrote:
>
>
>
> Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan@xxxxxxxxxxxxxx> ha scritto:
>>
>>
>> Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible.
>>
>
> I will investigate I heard that erasure coding is slow.
>
> Anyway I will write here the reason of this thread:
> In my customers I have usually proxmox+ceph with:
>
> - three servers
> - three monitors
> - 6 osd (two per server)
> - size=3 and min_size=2
>
> I followed the recommendations to stay safe.
> But one day one disk of one server has broken, osd where at 55%.
> What happened then?
> Ceph started filling the remaining OSD to maintain size=3
> OSD reached 90% ceph stopped all.
> Customer VMs froze and customer lost time and some data that was not written on disk.
>
> So I got angry.... size=3 and customer still loses time and data?

You should size the osd fullness config in such a way that failure you
expect would still leave sufficient capacity.
In our case, we plan so that we could lose and re-replicate an entire
rack and still have enough space left. -- (IOW, with 5-6 racks, we
start to add capacity when the clusters reach ~70-75% full)

In your case, the issue is more extreme:
Because you have 3 hosts, 2 osds each, and 3 replicas: when one OSD
fails and is marked out, you are telling ceph that *all* of the
objects will need to be written to the last remaining disk on that
host with the failure.
So unless your cluster was under 40-50% used, that osd is going to
become overfull. (But BTW, ceph will get backfillfull on the loaded
OSD before stopping IO -- this should not have blocked your user
unless they *also* filled the disk with new data at the same time).

IMO with a cluster this size, you should not ever mark out any OSDs --
rather, you should leave the PGs degraded, replace the disk (keep the
same OSD ID), then recover those objects to the new disk.
Or, keep it <40% used (which sounds like a waste).

-- dan

>
>
>
>
>
>>
>> Cheers, Dan
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco@xxxxxxxxx> wrote:
>>>
>>> Thanks Simon and thanks to other people that have replied.
>>> Sorry but I try to explain myself better.
>>> It is evident to me that if I have two copies of data, one brokes and while
>>> ceph creates again a new copy of the data also the disk with the second
>>> copy brokes you lose the data.
>>> It is obvious and a bit paranoid because many servers on many customers run
>>> on raid1 and so you are saying: yeah you have two copies of the data but
>>> you can broke both. Consider that in ceph recovery is automatic, with raid1
>>> some one must manually go to the customer and change disks. So ceph is
>>> already an improvement in this case even with size=2. With size 3 and min 2
>>> it is a bigger improvement I know.
>>>
>>> What I ask is this: what happens with min_size=1 and split brain, network
>>> down or similar things: do ceph block writes because it has no quorum on
>>> monitors? Are there some failure scenarios that I have not considered?
>>> Thanks again!
>>> Mario
>>>
>>>
>>>
>>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
>>> sironside@xxxxxxxxxxxxx> ha scritto:
>>>
>>> > On 03/02/2021 09:24, Mario Giammarco wrote:
>>> > > Hello,
>>> > > Imagine this situation:
>>> > > - 3 servers with ceph
>>> > > - a pool with size 2 min 1
>>> > >
>>> > > I know perfectly the size 3 and min 2 is better.
>>> > > I would like to know what is the worst thing that can happen:
>>> >
>>> > Hi Mario,
>>> >
>>> > This thread is worth a read, it's an oldie but a goodie:
>>> >
>>> >
>>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
>>> >
>>> > Especially this post, which helped me understand the importance of
>>> > min_size=2
>>> >
>>> >
>>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
>>> >
>>> > Cheers,
>>> > Simon
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> >
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx