Re: Worst thing that can happen if I have size= 2

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 5 Feb 2021 09:26:51 +0100

On Thu, Feb 4, 2021 at 10:30 PM huxiaoyu@xxxxxxxxxxxx
<huxiaoyu@xxxxxxxxxxxx> wrote:
>
> >IMO with a cluster this size, you should not ever mark out any OSDs --
> >rather, you should leave the PGs degraded, replace the disk (keep the
> >same OSD ID), then recover those objects to the new disk.
> >Or, keep it <40% used (which sounds like a waste).
>
> Dear Dan,
>
> I particularly like your idea of " leave the PGs degraded, and replace the disk with the same OSD ID ". This is a wonderful thing i really want to do.
>
> Could you please share some more details on how to achieve this, or some scripts already being tested?

Hi Samuel,

To do this you'd first set a config so that down osds aren't
automatically marked out, either by setting
`mon_osd_down_out_interval` to a really large value, or perhaps
mon_osd_down_out_subtree_limit = osd would work too.

Then when the osd fails, PGs will become degraded.
You just zap, then replace, then recreate the OSD. Here are the
ceph-volume commands for that:
   * ceph-volume lvm zap --osd-id 1234 --destroy
   * (then you might need `ceph osd destroy 1234`)
   * (then replace the drive)
   * ceph-volume lvm create <according to your hw> --osd-id 1234

Cheers, Dan

>
> thanks a lot,
>
> Samuel
>
>
>
> ________________________________
> huxiaoyu@xxxxxxxxxxxx
>
>
> From: Dan van der Ster
> Date: 2021-02-04 11:57
> To: Mario Giammarco
> CC: Ceph Users
> Subject:  Re: Worst thing that can happen if I have size= 2
> On Thu, Feb 4, 2021 at 11:30 AM Mario Giammarco <mgiammarco@xxxxxxxxx> wrote:
> >
> >
> >
> > Il giorno mer 3 feb 2021 alle ore 21:22 Dan van der Ster <dan@xxxxxxxxxxxxxx> ha scritto:
> >>
> >>
> >> Lastly, if you can't afford 3x replicas, then use 2+2 erasure coding if possible.
> >>
> >
> > I will investigate I heard that erasure coding is slow.
> >
> > Anyway I will write here the reason of this thread:
> > In my customers I have usually proxmox+ceph with:
> >
> > - three servers
> > - three monitors
> > - 6 osd (two per server)
> > - size=3 and min_size=2
> >
> > I followed the recommendations to stay safe.
> > But one day one disk of one server has broken, osd where at 55%.
> > What happened then?
> > Ceph started filling the remaining OSD to maintain size=3
> > OSD reached 90% ceph stopped all.
> > Customer VMs froze and customer lost time and some data that was not written on disk.
> >
> > So I got angry.... size=3 and customer still loses time and data?
>
> You should size the osd fullness config in such a way that failure you
> expect would still leave sufficient capacity.
> In our case, we plan so that we could lose and re-replicate an entire
> rack and still have enough space left. -- (IOW, with 5-6 racks, we
> start to add capacity when the clusters reach ~70-75% full)
>
> In your case, the issue is more extreme:
> Because you have 3 hosts, 2 osds each, and 3 replicas: when one OSD
> fails and is marked out, you are telling ceph that *all* of the
> objects will need to be written to the last remaining disk on that
> host with the failure.
> So unless your cluster was under 40-50% used, that osd is going to
> become overfull. (But BTW, ceph will get backfillfull on the loaded
> OSD before stopping IO -- this should not have blocked your user
> unless they *also* filled the disk with new data at the same time).
>
> IMO with a cluster this size, you should not ever mark out any OSDs --
> rather, you should leave the PGs degraded, replace the disk (keep the
> same OSD ID), then recover those objects to the new disk.
> Or, keep it <40% used (which sounds like a waste).
>
> -- dan
>
>
>
>
>
> >
> >
> >
> >
> >
> >>
> >> Cheers, Dan
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Feb 3, 2021, 8:49 PM Mario Giammarco <mgiammarco@xxxxxxxxx> wrote:
> >>>
> >>> Thanks Simon and thanks to other people that have replied.
> >>> Sorry but I try to explain myself better.
> >>> It is evident to me that if I have two copies of data, one brokes and while
> >>> ceph creates again a new copy of the data also the disk with the second
> >>> copy brokes you lose the data.
> >>> It is obvious and a bit paranoid because many servers on many customers run
> >>> on raid1 and so you are saying: yeah you have two copies of the data but
> >>> you can broke both. Consider that in ceph recovery is automatic, with raid1
> >>> some one must manually go to the customer and change disks. So ceph is
> >>> already an improvement in this case even with size=2. With size 3 and min 2
> >>> it is a bigger improvement I know.
> >>>
> >>> What I ask is this: what happens with min_size=1 and split brain, network
> >>> down or similar things: do ceph block writes because it has no quorum on
> >>> monitors? Are there some failure scenarios that I have not considered?
> >>> Thanks again!
> >>> Mario
> >>>
> >>>
> >>>
> >>> Il giorno mer 3 feb 2021 alle ore 17:42 Simon Ironside <
> >>> sironside@xxxxxxxxxxxxx> ha scritto:
> >>>
> >>> > On 03/02/2021 09:24, Mario Giammarco wrote:
> >>> > > Hello,
> >>> > > Imagine this situation:
> >>> > > - 3 servers with ceph
> >>> > > - a pool with size 2 min 1
> >>> > >
> >>> > > I know perfectly the size 3 and min 2 is better.
> >>> > > I would like to know what is the worst thing that can happen:
> >>> >
> >>> > Hi Mario,
> >>> >
> >>> > This thread is worth a read, it's an oldie but a goodie:
> >>> >
> >>> >
> >>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
> >>> >
> >>> > Especially this post, which helped me understand the importance of
> >>> > min_size=2
> >>> >
> >>> >
> >>> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014892.html
> >>> >
> >>> > Cheers,
> >>> > Simon
> >>> > _______________________________________________
> >>> > ceph-users mailing list -- ceph-users@xxxxxxx
> >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>> >
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx