On Thu, Feb 4, 2021 at 10:30 PM huxiaoyu@xxxxxxxxxxxx <huxiaoyu@xxxxxxxxxxxx> wrote: > > >IMO with a cluster this size, you should not ever mark out any OSDs -- > >rather, you should leave the PGs degraded, replace the disk (keep the > >same OSD ID), then recover those objects to the new disk. > >Or, keep it <40% used (which sounds like a waste). > > Dear Dan, > > I particularly like your idea of " leave the PGs degraded, and replace the disk with the same OSD ID ". This is a wonderful thing i really want to do. > > Could you please share some more details on how to achieve this, or some scripts already being tested? Hi Samuel, To do this you'd first set a config so that down osds aren't automatically marked out, either by setting `mon_osd_down_out_interval` to a really large value, or perhaps mon_osd_down_out_subtree_limit = osd would work too. Then when the osd fails, PGs will become degraded. You just zap, then replace, then recreate the OSD. Here are the ceph-volume commands for that: * ceph-volume lvm zap --osd-id 1234 --destroy * (then you might need `ceph osd destroy 1234`) * (then replace the drive) * ceph-volume lvm create <according to your hw> --osd-id 1234 Cheers, Dan > > thanks a lot, > > Samuel > > > > ________________________________ > huxiaoyu@xxxxxxxxxxxx > > > From: Dan van der Ster > Date: 2021-02-04 11:57 > To: Mario Giammarco > CC: Ceph Users > Subject: Re: Worst thing that can happen if I have size= 2 > On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco <mgiammarco@xxxxxxxxx> wrote: > > > > > > > > Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan@xxxxxxxxxxxxxx> ha scritto: > >> > >> > >> Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible. > >> > > > > I will investigate I heard that erasure coding is slow. > > > > Anyway I will write here the reason of this thread: > > In my customers I have usually proxmox+ceph with: > > > > - three servers > > - three monitors > > - 6 osd (two per server) > > - size=3 and min_size=2 > > > > I followed the recommendations to stay safe. > > But one day one disk of one server has broken, osd where at 55%. > > What happened then? > > Ceph started filling the remaining OSD to maintain size=3 > > OSD reached 90% ceph stopped all. > > Customer VMs froze and customer lost time and some data that was not written on disk. > > > > So I got angry.... size=3 and customer still loses time and data? > > You should size the osd fullness config in such a way that failure you > expect would still leave sufficient capacity. > In our case, we plan so that we could lose and re-replicate an entire > rack and still have enough space left. -- (IOW, with 5-6 racks, we > start to add capacity when the clusters reach ~70-75% full) > > In your case, the issue is more extreme: > Because you have 3 hosts, 2 osds each, and 3 replicas: when one OSD > fails and is marked out, you are telling ceph that *all* of the > objects will need to be written to the last remaining disk on that > host with the failure. > So unless your cluster was under 40-50% used, that osd is going to > become overfull. (But BTW, ceph will get backfillfull on the loaded > OSD before stopping IO -- this should not have blocked your user > unless they *also* filled the disk with new data at the same time). > > IMO with a cluster this size, you should not ever mark out any OSDs -- > rather, you should leave the PGs degraded, replace the disk (keep the > same OSD ID), then recover those objects to the new disk. > Or, keep it <40% used (which sounds like a waste). > > -- dan > > > > > > > > > > > > > > > > >> > >> Cheers, Dan > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco@xxxxxxxxx> wrote: > >>> > >>> Thanks Simon and thanks to other people that have replied. > >>> Sorry but I try to explain myself better. > >>> It is evident to me that if I have two copies of data, one brokes and while > >>> ceph creates again a new copy of the data also the disk with the second > >>> copy brokes you lose the data. > >>> It is obvious and a bit paranoid because many servers on many customers run > >>> on raid1 and so you are saying: yeah you have two copies of the data but > >>> you can broke both. Consider that in ceph recovery is automatic, with raid1 > >>> some one must manually go to the customer and change disks. So ceph is > >>> already an improvement in this case even with size=2. With size 3 and min 2 > >>> it is a bigger improvement I know. > >>> > >>> What I ask is this: what happens with min_size=1 and split brain, network > >>> down or similar things: do ceph block writes because it has no quorum on > >>> monitors? Are there some failure scenarios that I have not considered? > >>> Thanks again! > >>> Mario > >>> > >>> > >>> > >>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside < > >>> sironside@xxxxxxxxxxxxx> ha scritto: > >>> > >>> > On 03/02/2021 09:24, Mario Giammarco wrote: > >>> > > Hello, > >>> > > Imagine this situation: > >>> > > - 3 servers with ceph > >>> > > - a pool with size 2 min 1 > >>> > > > >>> > > I know perfectly the size 3 and min 2 is better. > >>> > > I would like to know what is the worst thing that can happen: > >>> > > >>> > Hi Mario, > >>> > > >>> > This thread is worth a read, it's an oldie but a goodie: > >>> > > >>> > > >>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html > >>> > > >>> > Especially this post, which helped me understand the importance of > >>> > min_size=2 > >>> > > >>> > > >>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html > >>> > > >>> > Cheers, > >>> > Simon > >>> > _______________________________________________ > >>> > ceph-users mailing list -- ceph-users@xxxxxxx > >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx