Re: Help needed to configure erasure coding LRC plugin

Curt <lightspd@xxxxxxxxx> · Sat, 29 Apr 2023 22:36:17 +0400

Hello,

What is your current setup, 1 server pet data center with 12 osd each? What
is your current crush rule and LRC crush rule?

On Fri, Apr 28, 2023, 12:29 Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>
wrote:

> Hi,
>
> I think I found a possible cause of my PG down but still understand why.
> As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9,
> m=6) but I have only 12 OSD servers in the cluster. To workaround the
> problem I defined the failure domain as 'osd' with the reasoning that as
> I was using the LRC plugin, I had the warranty that I could loose a site
> without impact, thus the possibility to loose 1 OSD server. Am I wrong?
>
> Best regards,
>
> Michel
>
> Le 24/04/2023 à 13:24, Michel Jouvin a écrit :
> > Hi,
> >
> > I'm still interesting by getting feedback from those using the LRC
> > plugin about the right way to configure it... Last week I upgraded
> > from Pacific to Quincy (17.2.6) with cephadm which is doing the
> > upgrade host by host, checking if an OSD is ok to stop before actually
> > upgrading it. I had the surprise to see 1 or 2 PGs down at some points
> > in the upgrade (happened not for all OSDs but for every
> > site/datacenter). Looking at the details with "ceph health detail", I
> > saw that for these PGs there was 3 OSDs down but I was expecting the
> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm
> > wondering if there is something wrong in our pool configuration (k=9,
> > m=6, l=5).
> >
> > Cheers,
> >
> > Michel
> >
> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit :
> >> Hi,
> >>
> >> Is somebody using LRC plugin ?
> >>
> >> I came to the conclusion that LRC  k=9, m=3, l=4 is not the same as
> >> jerasure k=9, m=6 in terms of protection against failures and that I
> >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure
> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) suggests
> >> that this LRC configuration gives something better than jerasure k=4,
> >> m=2 as it is resilient to 3 drive failures (but not 4 if I understood
> >> properly). So how many drives can fail in the k=9, m=6, l=5
> >> configuration first without loosing RW access and second without
> >> loosing data?
> >>
> >> Another thing that I don't quite understand is that a pool created
> >> with this configuration (and failure domain=osd, locality=datacenter)
> >> has a min_size=3 (max_size=18 as expected). It seems wrong to me, I'd
> >> expected something ~10 (depending on answer to the previous question)...
> >>
> >> Thanks in advance if somebody could provide some sort of
> >> authoritative answer on these 2 questions. Best regards,
> >>
> >> Michel
> >>
> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
> >>> Answering to myself, I found the reason for 2147483647: it's
> >>> documented as a failure to find enough OSD (missing OSDs). And it is
> >>> normal as I selected different hosts for the 15 OSDs but I have only
> >>> 12 hosts!
> >>>
> >>> I'm still interested by an "expert" to confirm that LRC  k=9, m=3,
> >>> l=4 configuration is equivalent, in terms of redundancy, to a
> >>> jerasure configuration with k=9, m=6.
> >>>
> >>> Michel
> >>>
> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit :
> >>>> Hi,
> >>>>
> >>>> As discussed in another thread (Crushmap rule for multi-datacenter
> >>>> erasure coding), I'm trying to create an EC pool spanning 3
> >>>> datacenters (datacenters are present in the crushmap), with the
> >>>> objective to be resilient to 1 DC down, at least keeping the
> >>>> readonly access to the pool and if possible the read-write access,
> >>>> and have a storage efficiency better than 3 replica (let say a
> >>>> storage overhead <= 2).
> >>>>
> >>>> In the discussion, somebody mentioned LRC plugin as a possible
> >>>> jerasure alternative to implement this without tweaking the
> >>>> crushmap rule to implement the 2-step OSD allocation. I looked at
> >>>> the documentation
> >>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/)
> >>>> but I have some questions if someone has experience/expertise with
> >>>> this LRC plugin.
> >>>>
> >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in
> >>>> total), with 3 (9 in total) being data chunks and others being
> >>>> coding chunks. For this, based of my understanding of examples, I
> >>>> used k=9, m=3, l=4. Is it right? Is this configuration equivalent,
> >>>> in terms of redundancy, to a jerasure configuration with k=9, m=6?
> >>>>
> >>>> The resulting rule, which looks correct to me, is:
> >>>>
> >>>> --------
> >>>>
> >>>> {
> >>>>     "rule_id": 6,
> >>>>     "rule_name": "test_lrc_2",
> >>>>     "ruleset": 6,
> >>>>     "type": 3,
> >>>>     "min_size": 3,
> >>>>     "max_size": 15,
> >>>>     "steps": [
> >>>>         {
> >>>>             "op": "set_chooseleaf_tries",
> >>>>             "num": 5
> >>>>         },
> >>>>         {
> >>>>             "op": "set_choose_tries",
> >>>>             "num": 100
> >>>>         },
> >>>>         {
> >>>>             "op": "take",
> >>>>             "item": -4,
> >>>>             "item_name": "default~hdd"
> >>>>         },
> >>>>         {
> >>>>             "op": "choose_indep",
> >>>>             "num": 3,
> >>>>             "type": "datacenter"
> >>>>         },
> >>>>         {
> >>>>             "op": "chooseleaf_indep",
> >>>>             "num": 5,
> >>>>             "type": "host"
> >>>>         },
> >>>>         {
> >>>>             "op": "emit"
> >>>>         }
> >>>>     ]
> >>>> }
> >>>>
> >>>> ------------
> >>>>
> >>>> Unfortunately, it doesn't work as expected: a pool created with
> >>>> this rule ends up with its pages active+undersize, which is
> >>>> unexpected for me. Looking at 'ceph health detail` output, I see
> >>>> for each page something like:
> >>>>
> >>>> pg 52.14 is stuck undersized for 27m, current state
> >>>> active+undersized, last acting
> >>>>
> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]
> >>>>
> >>>> For each PG, there is 3 '2147483647' entries and I guess it is the
> >>>> reason of the problem. What are these entries about? Clearly it is
> >>>> not OSD entries... Looks like a negative number, -1, which in terms
> >>>> of crushmap ID is the crushmap root (named "default" in our
> >>>> configuration). Any trivial mistake I would have made?
> >>>>
> >>>> Thanks in advance for any help or for sharing any successful
> >>>> configuration?
> >>>>
> >>>> Best regards,
> >>>>
> >>>> Michel
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx