>IMO with a cluster this size, you should not ever mark out any OSDs -- >rather, you should leave the PGs degraded, replace the disk (keep the >same OSD ID), then recover those objects to the new disk. >Or, keep it <40% used (which sounds like a waste). Dear Dan, I particularly like your idea of " leave the PGs degraded, and replace the disk with the same OSD ID ". This is a wonderful thing i really want to do. Could you please share some more details on how to achieve this, or some scripts already being tested? thanks a lot, Samuel huxiaoyu@xxxxxxxxxxxx From: Dan van der Ster Date: 2021-02-04 11:57 To: Mario Giammarco CC: Ceph Users Subject: Re: Worst thing that can happen if I have size= 2 On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco <mgiammarco@xxxxxxxxx> wrote: > > > > Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan@xxxxxxxxxxxxxx> ha scritto: >> >> >> Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible. >> > > I will investigate I heard that erasure coding is slow. > > Anyway I will write here the reason of this thread: > In my customers I have usually proxmox+ceph with: > > - three servers > - three monitors > - 6 osd (two per server) > - size=3 and min_size=2 > > I followed the recommendations to stay safe. > But one day one disk of one server has broken, osd where at 55%. > What happened then? > Ceph started filling the remaining OSD to maintain size=3 > OSD reached 90% ceph stopped all. > Customer VMs froze and customer lost time and some data that was not written on disk. > > So I got angry.... size=3 and customer still loses time and data? You should size the osd fullness config in such a way that failure you expect would still leave sufficient capacity. In our case, we plan so that we could lose and re-replicate an entire rack and still have enough space left. -- (IOW, with 5-6 racks, we start to add capacity when the clusters reach ~70-75% full) In your case, the issue is more extreme: Because you have 3 hosts, 2 osds each, and 3 replicas: when one OSD fails and is marked out, you are telling ceph that *all* of the objects will need to be written to the last remaining disk on that host with the failure. So unless your cluster was under 40-50% used, that osd is going to become overfull. (But BTW, ceph will get backfillfull on the loaded OSD before stopping IO -- this should not have blocked your user unless they *also* filled the disk with new data at the same time). IMO with a cluster this size, you should not ever mark out any OSDs -- rather, you should leave the PGs degraded, replace the disk (keep the same OSD ID), then recover those objects to the new disk. Or, keep it <40% used (which sounds like a waste). -- dan > > > > > >> >> Cheers, Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco@xxxxxxxxx> wrote: >>> >>> Thanks Simon and thanks to other people that have replied. >>> Sorry but I try to explain myself better. >>> It is evident to me that if I have two copies of data, one brokes and while >>> ceph creates again a new copy of the data also the disk with the second >>> copy brokes you lose the data. >>> It is obvious and a bit paranoid because many servers on many customers run >>> on raid1 and so you are saying: yeah you have two copies of the data but >>> you can broke both. Consider that in ceph recovery is automatic, with raid1 >>> some one must manually go to the customer and change disks. So ceph is >>> already an improvement in this case even with size=2. With size 3 and min 2 >>> it is a bigger improvement I know. >>> >>> What I ask is this: what happens with min_size=1 and split brain, network >>> down or similar things: do ceph block writes because it has no quorum on >>> monitors? Are there some failure scenarios that I have not considered? >>> Thanks again! >>> Mario >>> >>> >>> >>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside < >>> sironside@xxxxxxxxxxxxx> ha scritto: >>> >>> > On 03/02/2021 09:24, Mario Giammarco wrote: >>> > > Hello, >>> > > Imagine this situation: >>> > > - 3 servers with ceph >>> > > - a pool with size 2 min 1 >>> > > >>> > > I know perfectly the size 3 and min 2 is better. >>> > > I would like to know what is the worst thing that can happen: >>> > >>> > Hi Mario, >>> > >>> > This thread is worth a read, it's an oldie but a goodie: >>> > >>> > >>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html >>> > >>> > Especially this post, which helped me understand the importance of >>> > min_size=2 >>> > >>> > >>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html >>> > >>> > Cheers, >>> > Simon >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users@xxxxxxx >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> > >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx