Re: Help needed to configure erasure coding LRC plugin

Curt <lightspd@xxxxxxxxx> · Thu, 18 May 2023 00:33:51 +0400

Hi,

I've been following this thread with interest as it seems like a unique use
case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made changes since your
first post and had some thoughts I wanted to share, but wanted to see your
rule first so I could try to visualize the distribution better.  The only
way I can currently visualize it working is with more servers, I'm thinking
6 or 9 per data center min, but that could be my lack of knowledge on some
of the step rules.

Thanks
Curt

On Tue, May 16, 2023 at 11:09 AM Michel Jouvin <
michel.jouvin@xxxxxxxxxxxxxxx> wrote:

> Hi Eugen,
>
> Yes, sure, no problem to share it. I attach it to this email (as it may
> clutter the discussion if inline).
>
> If somebody on the list has some clue on the LRC plugin, I'm still
> interested by understand what I'm doing wrong!
>
> Cheers,
>
> Michel
>
> Le 04/05/2023 à 15:07, Eugen Block a écrit :
> > Hi,
> >
> > I don't think you've shared your osd tree yet, could you do that?
> > Apparently nobody else but us reads this thread or nobody reading this
> > uses the LRC plugin. ;-)
> >
> > Thanks,
> > Eugen
> >
> > Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:
> >
> >> Hi,
> >>
> >> I had to restart one of my OSD server today and the problem showed up
> >> again. This time I managed to capture "ceph health detail" output
> >> showing the problem with the 2 PGs:
> >>
> >> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2
> >> pgs down
> >>     pg 56.1 is down, acting
> >> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
> >>     pg 56.12 is down, acting
> >>
> [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]
> >>
> >> I still doesn't understand why, if I am supposed to survive to a
> >> datacenter failure, I cannot survive to 3 OSDs down on the same host,
> >> hosting shards for the PG. In the second case it is only 2 OSDs down
> >> but I'm surprised they don't seem in the same "group" of OSD (I'd
> >> expected all the the OSDs of one datacenter to be in the same groupe
> >> of 5 if the order given really reflects the allocation done...
> >>
> >> Still interested by some explanation on what I'm doing wrong! Best
> >> regards,
> >>
> >> Michel
> >>
> >> Le 03/05/2023 à 10:21, Eugen Block a écrit :
> >>> I think I got it wrong with the locality setting, I'm still limited
> >>> by the number of hosts I have available in my test cluster, but as
> >>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with
> >>> locality=datacenter could fit your requirement, at least with
> >>> regards to the recovery bandwidth usage between DCs, but the
> >>> resiliency would not match your requirement (one DC failure). That
> >>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one
> >>> parity chunk) across three DCs, in total 12 chunks. The min_size=7
> >>> would not allow an entire DC to go down, I'm afraid, you'd have to
> >>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm
> >>> still not sure if I got it right this time, but maybe you're better
> >>> off without the LRC plugin with the limited number of hosts. Instead
> >>> you could use the jerasure plugin with a profile like k=4 m=5
> >>> allowing an entire DC to fail without losing data access (we have
> >>> one customer using that).
> >>>
> >>> Zitat von Eugen Block <eblock@xxxxxx>:
> >>>
> >>>> Hi,
> >>>>
> >>>> disclaimer: I haven't used LRC in a real setup yet, so there might
> >>>> be some misunderstandings on my side. But I tried to play around
> >>>> with one of my test clusters (Nautilus). Because I'm limited in the
> >>>> number of hosts (6 across 3 virtual DCs) I tried two different
> >>>> profiles with lower numbers to get a feeling for how that works.
> >>>>
> >>>> # first attempt
> >>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
> >>>> k=4 m=2 l=3 crush-failure-domain=host
> >>>>
> >>>> For every third OSD one parity chunk is added, so 2 more chunks to
> >>>> store ==> 8 chunks in total. Since my failure-domain is host and I
> >>>> only have 6 I get incomplete PGs.
> >>>>
> >>>> # second attempt
> >>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
> >>>> k=2 m=2 l=2 crush-failure-domain=host
> >>>>
> >>>> This gives me 6 chunks in total to store across 6 hosts which works:
> >>>>
> >>>> ceph:~ # ceph pg ls-by-pool lrcpool
> >>>> PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*
> >>>> OMAP_KEYS* LOG STATE        SINCE VERSION REPORTED
> >>>> UP                    ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
> >>>> 50.0       1        0         0       0   619 0          0 1
> >>>> active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27
> >>>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02
> >>>> 14:53:54.322135
> >>>> 50.1       0        0         0       0     0 0          0 0
> >>>> active+clean    6m     0'0 18414:26 [27,33,22,6,13,34]p27
> >>>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02
> >>>> 14:53:54.322135
> >>>> 50.2       0        0         0       0     0 0          0 0
> >>>> active+clean    6m     0'0 18413:25 [1,28,14,4,31,21]p1
> >>>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02
> >>>> 14:53:54.322135
> >>>> 50.3       0        0         0       0     0 0          0 0
> >>>> active+clean    6m     0'0 18413:24 [8,16,26,33,7,25]p8
> >>>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02
> >>>> 14:53:54.322135
> >>>>
> >>>> After stopping all OSDs on one host I was still able to read and
> >>>> write into the pool, but after stopping a second host one PG from
> >>>> that pool went "down". That I don't fully understand yet, but I
> >>>> just started to look into it.
> >>>> With your setup (12 hosts) I would recommend to not utilize all of
> >>>> them so you have capacity to recover, let's say one "spare" host
> >>>> per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could
> >>>> make sense here, resulting in 9 total chunks (one more parity
> >>>> chunks for every other OSD), min_size 4. But as I wrote, it
> >>>> probably doesn't have the resiliency for a DC failure, so that
> >>>> needs some further investigation.
> >>>>
> >>>> Regards,
> >>>> Eugen
> >>>>
> >>>> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> No... our current setup is 3 datacenters with the same
> >>>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each.
> >>>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be
> >>>>> a multiple of l, I found that k=9/m=66/l=5 with
> >>>>> crush-locality=datacenter was achieving my goal of being resilient
> >>>>> to a datacenter failure. Because I had this, I considered that
> >>>>> lowering the crush failure domain to osd was not a major issue in
> >>>>> my case (as it would not be worst than a datacenter failure if all
> >>>>> the shards are on the same server in a datacenter) and was working
> >>>>> around the lack of hosts for k=9/m=6 (15 OSDs).
> >>>>>
> >>>>> May be it helps, if I give the erasure code profile used:
> >>>>>
> >>>>> crush-device-class=hdd
> >>>>> crush-failure-domain=osd
> >>>>> crush-locality=datacenter
> >>>>> crush-root=default
> >>>>> k=9
> >>>>> l=5
> >>>>> m=6
> >>>>> plugin=lrc
> >>>>>
> >>>>> The previously mentioned strange number for min_size for the pool
> >>>>> created with this profile has vanished after Quincy upgrade as
> >>>>> this parameter is no longer in the CRUH map rule! and the `ceph
> >>>>> osd pool get` command reports the expected number (10):
> >>>>>
> >>>>> ---------
> >>>>>
> >>>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size
> >>>>> min_size: 10
> >>>>> --------
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Michel
> >>>>>
> >>>>> Le 29/04/2023 à 20:36, Curt a écrit :
> >>>>>> Hello,
> >>>>>>
> >>>>>> What is your current setup, 1 server pet data center with 12 osd
> >>>>>> each? What is your current crush rule and LRC crush rule?
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin
> >>>>>> <michel.jouvin@xxxxxxxxxxxxxxx> wrote:
> >>>>>>
> >>>>>>   Hi,
> >>>>>>
> >>>>>>   I think I found a possible cause of my PG down but still
> >>>>>>   understand why.
> >>>>>>   As explained in a previous mail, I setup a 15-chunk/OSD EC pool
> >>>>>> (k=9,
> >>>>>>   m=6) but I have only 12 OSD servers in the cluster. To
> >>>>>> workaround the
> >>>>>>   problem I defined the failure domain as 'osd' with the reasoning
> >>>>>>   that as
> >>>>>>   I was using the LRC plugin, I had the warranty that I could loose
> >>>>>>   a site
> >>>>>>   without impact, thus the possibility to loose 1 OSD server. Am I
> >>>>>>   wrong?
> >>>>>>
> >>>>>>   Best regards,
> >>>>>>
> >>>>>>   Michel
> >>>>>>
> >>>>>>   Le 24/04/2023 à 13:24, Michel Jouvin a écrit :
> >>>>>>   > Hi,
> >>>>>>   >
> >>>>>>   > I'm still interesting by getting feedback from those using
> >>>>>> the LRC
> >>>>>>   > plugin about the right way to configure it... Last week I
> >>>>>> upgraded
> >>>>>>   > from Pacific to Quincy (17.2.6) with cephadm which is doing the
> >>>>>>   > upgrade host by host, checking if an OSD is ok to stop before
> >>>>>>   actually
> >>>>>>   > upgrading it. I had the surprise to see 1 or 2 PGs down at some
> >>>>>>   points
> >>>>>>   > in the upgrade (happened not for all OSDs but for every
> >>>>>>   > site/datacenter). Looking at the details with "ceph health
> >>>>>>   detail", I
> >>>>>>   > saw that for these PGs there was 3 OSDs down but I was expecting
> >>>>>>   the
> >>>>>>   > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm
> >>>>>>   > wondering if there is something wrong in our pool configuration
> >>>>>>   (k=9,
> >>>>>>   > m=6, l=5).
> >>>>>>   >
> >>>>>>   > Cheers,
> >>>>>>   >
> >>>>>>   > Michel
> >>>>>>   >
> >>>>>>   > Le 06/04/2023 à 08:51, Michel Jouvin a écrit :
> >>>>>>   >> Hi,
> >>>>>>   >>
> >>>>>>   >> Is somebody using LRC plugin ?
> >>>>>>   >>
> >>>>>>   >> I came to the conclusion that LRC  k=9, m=3, l=4 is not the
> >>>>>>   same as
> >>>>>>   >> jerasure k=9, m=6 in terms of protection against failures and
> >>>>>>   that I
> >>>>>>   >> should use k=9, m=6, l=5 to get a level of resilience >=
> >>>>>> jerasure
> >>>>>>   >> k=9, m=6. The example in the documentation (k=4, m=2, l=3)
> >>>>>>   suggests
> >>>>>>   >> that this LRC configuration gives something better than
> >>>>>>   jerasure k=4,
> >>>>>>   >> m=2 as it is resilient to 3 drive failures (but not 4 if I
> >>>>>>   understood
> >>>>>>   >> properly). So how many drives can fail in the k=9, m=6, l=5
> >>>>>>   >> configuration first without loosing RW access and second
> >>>>>> without
> >>>>>>   >> loosing data?
> >>>>>>   >>
> >>>>>>   >> Another thing that I don't quite understand is that a pool
> >>>>>> created
> >>>>>>   >> with this configuration (and failure domain=osd,
> >>>>>>   locality=datacenter)
> >>>>>>   >> has a min_size=3 (max_size=18 as expected). It seems wrong to
> >>>>>>   me, I'd
> >>>>>>   >> expected something ~10 (depending on answer to the previous
> >>>>>>   question)...
> >>>>>>   >>
> >>>>>>   >> Thanks in advance if somebody could provide some sort of
> >>>>>>   >> authoritative answer on these 2 questions. Best regards,
> >>>>>>   >>
> >>>>>>   >> Michel
> >>>>>>   >>
> >>>>>>   >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
> >>>>>>   >>> Answering to myself, I found the reason for 2147483647: it's
> >>>>>>   >>> documented as a failure to find enough OSD (missing OSDs). And
> >>>>>>   it is
> >>>>>>   >>> normal as I selected different hosts for the 15 OSDs but I
> >>>>>>   have only
> >>>>>>   >>> 12 hosts!
> >>>>>>   >>>
> >>>>>>   >>> I'm still interested by an "expert" to confirm that LRC  k=9,
> >>>>>>   m=3,
> >>>>>>   >>> l=4 configuration is equivalent, in terms of redundancy, to a
> >>>>>>   >>> jerasure configuration with k=9, m=6.
> >>>>>>   >>>
> >>>>>>   >>> Michel
> >>>>>>   >>>
> >>>>>>   >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit :
> >>>>>>   >>>> Hi,
> >>>>>>   >>>>
> >>>>>>   >>>> As discussed in another thread (Crushmap rule for
> >>>>>>   multi-datacenter
> >>>>>>   >>>> erasure coding), I'm trying to create an EC pool spanning 3
> >>>>>>   >>>> datacenters (datacenters are present in the crushmap),
> >>>>>> with the
> >>>>>>   >>>> objective to be resilient to 1 DC down, at least keeping the
> >>>>>>   >>>> readonly access to the pool and if possible the read-write
> >>>>>>   access,
> >>>>>>   >>>> and have a storage efficiency better than 3 replica (let
> >>>>>> say a
> >>>>>>   >>>> storage overhead <= 2).
> >>>>>>   >>>>
> >>>>>>   >>>> In the discussion, somebody mentioned LRC plugin as a
> >>>>>> possible
> >>>>>>   >>>> jerasure alternative to implement this without tweaking the
> >>>>>>   >>>> crushmap rule to implement the 2-step OSD allocation. I
> >>>>>>   looked at
> >>>>>>   >>>> the documentation
> >>>>>>   >>>>
> >>>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/
> )
> >>>>>>   >>>> but I have some questions if someone has experience/expertise
> >>>>>>   with
> >>>>>>   >>>> this LRC plugin.
> >>>>>>   >>>>
> >>>>>>   >>>> I tried to create a rule for using 5 OSDs per datacenter
> >>>>>> (15 in
> >>>>>>   >>>> total), with 3 (9 in total) being data chunks and others
> >>>>>> being
> >>>>>>   >>>> coding chunks. For this, based of my understanding of
> >>>>>>   examples, I
> >>>>>>   >>>> used k=9, m=3, l=4. Is it right? Is this configuration
> >>>>>>   equivalent,
> >>>>>>   >>>> in terms of redundancy, to a jerasure configuration with k=9,
> >>>>>>   m=6?
> >>>>>>   >>>>
> >>>>>>   >>>> The resulting rule, which looks correct to me, is:
> >>>>>>   >>>>
> >>>>>>   >>>> --------
> >>>>>>   >>>>
> >>>>>>   >>>> {
> >>>>>>   >>>>     "rule_id": 6,
> >>>>>>   >>>>     "rule_name": "test_lrc_2",
> >>>>>>   >>>>     "ruleset": 6,
> >>>>>>   >>>>     "type": 3,
> >>>>>>   >>>>     "min_size": 3,
> >>>>>>   >>>>     "max_size": 15,
> >>>>>>   >>>>     "steps": [
> >>>>>>   >>>>         {
> >>>>>>   >>>>             "op": "set_chooseleaf_tries",
> >>>>>>   >>>>             "num": 5
> >>>>>>   >>>>         },
> >>>>>>   >>>>         {
> >>>>>>   >>>>             "op": "set_choose_tries",
> >>>>>>   >>>>             "num": 100
> >>>>>>   >>>>         },
> >>>>>>   >>>>         {
> >>>>>>   >>>>             "op": "take",
> >>>>>>   >>>>             "item": -4,
> >>>>>>   >>>>             "item_name": "default~hdd"
> >>>>>>   >>>>         },
> >>>>>>   >>>>         {
> >>>>>>   >>>>             "op": "choose_indep",
> >>>>>>   >>>>             "num": 3,
> >>>>>>   >>>>             "type": "datacenter"
> >>>>>>   >>>>         },
> >>>>>>   >>>>         {
> >>>>>>   >>>>             "op": "chooseleaf_indep",
> >>>>>>   >>>>             "num": 5,
> >>>>>>   >>>>             "type": "host"
> >>>>>>   >>>>         },
> >>>>>>   >>>>         {
> >>>>>>   >>>>             "op": "emit"
> >>>>>>   >>>>         }
> >>>>>>   >>>>     ]
> >>>>>>   >>>> }
> >>>>>>   >>>>
> >>>>>>   >>>> ------------
> >>>>>>   >>>>
> >>>>>>   >>>> Unfortunately, it doesn't work as expected: a pool created
> >>>>>> with
> >>>>>>   >>>> this rule ends up with its pages active+undersize, which is
> >>>>>>   >>>> unexpected for me. Looking at 'ceph health detail` output,
> >>>>>> I see
> >>>>>>   >>>> for each page something like:
> >>>>>>   >>>>
> >>>>>>   >>>> pg 52.14 is stuck undersized for 27m, current state
> >>>>>>   >>>> active+undersized, last acting
> >>>>>>   >>>>
> >>>>>>
> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]
> >>>>>>
> >>>>>>   >>>>
> >>>>>>   >>>> For each PG, there is 3 '2147483647' entries and I guess it
> >>>>>>   is the
> >>>>>>   >>>> reason of the problem. What are these entries about? Clearly
> >>>>>>   it is
> >>>>>>   >>>> not OSD entries... Looks like a negative number, -1, which in
> >>>>>>   terms
> >>>>>>   >>>> of crushmap ID is the crushmap root (named "default" in our
> >>>>>>   >>>> configuration). Any trivial mistake I would have made?
> >>>>>>   >>>>
> >>>>>>   >>>> Thanks in advance for any help or for sharing any successful
> >>>>>>   >>>> configuration?
> >>>>>>   >>>>
> >>>>>>   >>>> Best regards,
> >>>>>>   >>>>
> >>>>>>   >>>> Michel
> >>>>>>   >>>> _______________________________________________
> >>>>>>   >>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>>>   >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>>   _______________________________________________
> >>>>>>   ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>>>   To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx