Re: Help needed to configure erasure coding LRC plugin

Frank Schilder <frans@xxxxxx> · Thu, 4 May 2023 14:35:01 +0000

Yep, reading but not using LRC. Please keep it on the ceph user list for future reference -- thanks!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Thursday, May 4, 2023 3:07 PM
To: ceph-users@xxxxxxx
Subject:  Re: Help needed to configure erasure coding LRC plugin

Hi,

I don't think you've shared your osd tree yet, could you do that?
Apparently nobody else but us reads this thread or nobody reading this
uses the LRC plugin. ;-)

Thanks,
Eugen

Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:

> Hi,
>
> I had to restart one of my OSD server today and the problem showed
> up again. This time I managed to capture "ceph health detail" output
> showing the problem with the 2 PGs:
>
> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down
>     pg 56.1 is down, acting
> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]
>     pg 56.12 is down, acting
> [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]
>
> I still doesn't understand why, if I am supposed to survive to a
> datacenter failure, I cannot survive to 3 OSDs down on the same
> host, hosting shards for the PG. In the second case it is only 2
> OSDs down but I'm surprised they don't seem in the same "group" of
> OSD (I'd expected all the the OSDs of one datacenter to be in the
> same groupe of 5 if the order given really reflects the allocation
> done...
>
> Still interested by some explanation on what I'm doing wrong! Best regards,
>
> Michel
>
> Le 03/05/2023 à 10:21, Eugen Block a écrit :
>> I think I got it wrong with the locality setting, I'm still limited
>> by the number of hosts I have available in my test cluster, but as
>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with
>> locality=datacenter could fit your requirement, at least with
>> regards to the recovery bandwidth usage between DCs, but the
>> resiliency would not match your requirement (one DC failure). That
>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one
>> parity chunk) across three DCs, in total 12 chunks. The min_size=7
>> would not allow an entire DC to go down, I'm afraid, you'd have to
>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm
>> still not sure if I got it right this time, but maybe you're better
>> off without the LRC plugin with the limited number of hosts.
>> Instead you could use the jerasure plugin with a profile like k=4
>> m=5 allowing an entire DC to fail without losing data access (we
>> have one customer using that).
>>
>> Zitat von Eugen Block <eblock@xxxxxx>:
>>
>>> Hi,
>>>
>>> disclaimer: I haven't used LRC in a real setup yet, so there might
>>> be some misunderstandings on my side. But I tried to play around
>>> with one of my test clusters (Nautilus). Because I'm limited in
>>> the number of hosts (6 across 3 virtual DCs) I tried two different
>>> profiles with lower numbers to get a feeling for how that works.
>>>
>>> # first attempt
>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
>>> k=4 m=2 l=3 crush-failure-domain=host
>>>
>>> For every third OSD one parity chunk is added, so 2 more chunks to
>>> store ==> 8 chunks in total. Since my failure-domain is host and I
>>> only have 6 I get incomplete PGs.
>>>
>>> # second attempt
>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
>>> k=2 m=2 l=2 crush-failure-domain=host
>>>
>>> This gives me 6 chunks in total to store across 6 hosts which works:
>>>
>>> ceph:~ # ceph pg ls-by-pool lrcpool
>>> PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*
>>> OMAP_KEYS* LOG STATE        SINCE VERSION REPORTED
>>> UP                    ACTING SCRUB_STAMP
>>> DEEP_SCRUB_STAMP
>>> 50.0       1        0         0       0   619 0          0   1
>>> active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27
>>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02
>>> 14:53:54.322135
>>> 50.1       0        0         0       0     0 0          0   0
>>> active+clean    6m     0'0 18414:26 [27,33,22,6,13,34]p27
>>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02
>>> 14:53:54.322135
>>> 50.2       0        0         0       0     0 0          0   0
>>> active+clean    6m     0'0 18413:25 [1,28,14,4,31,21]p1
>>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02
>>> 14:53:54.322135
>>> 50.3       0        0         0       0     0 0          0   0
>>> active+clean    6m     0'0 18413:24 [8,16,26,33,7,25]p8
>>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02
>>> 14:53:54.322135
>>>
>>> After stopping all OSDs on one host I was still able to read and
>>> write into the pool, but after stopping a second host one PG from
>>> that pool went "down". That I don't fully understand yet, but I
>>> just started to look into it.
>>> With your setup (12 hosts) I would recommend to not utilize all of
>>> them so you have capacity to recover, let's say one "spare" host
>>> per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could
>>> make sense here, resulting in 9 total chunks (one more parity
>>> chunks for every other OSD), min_size 4. But as I wrote, it
>>> probably doesn't have the resiliency for a DC failure, so that
>>> needs some further investigation.
>>>
>>> Regards,
>>> Eugen
>>>
>>> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:
>>>
>>>> Hi,
>>>>
>>>> No... our current setup is 3 datacenters with the same
>>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each.
>>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must
>>>> be a multiple of l, I found that k=9/m=66/l=5 with
>>>> crush-locality=datacenter was achieving my goal of being
>>>> resilient to a datacenter failure. Because I had this, I
>>>> considered that lowering the crush failure domain to osd was not
>>>> a major issue in my case (as it would not be worst than a
>>>> datacenter failure if all the shards are on the same server in a
>>>> datacenter) and was working around the lack of hosts for k=9/m=6
>>>> (15 OSDs).
>>>>
>>>> May be it helps, if I give the erasure code profile used:
>>>>
>>>> crush-device-class=hdd
>>>> crush-failure-domain=osd
>>>> crush-locality=datacenter
>>>> crush-root=default
>>>> k=9
>>>> l=5
>>>> m=6
>>>> plugin=lrc
>>>>
>>>> The previously mentioned strange number for min_size for the pool
>>>> created with this profile has vanished after Quincy upgrade as
>>>> this parameter is no longer in the CRUH map rule! and the `ceph
>>>> osd pool get` command reports the expected number (10):
>>>>
>>>> ---------
>>>>
>>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size
>>>> min_size: 10
>>>> --------
>>>>
>>>> Cheers,
>>>>
>>>> Michel
>>>>
>>>> Le 29/04/2023 à 20:36, Curt a écrit :
>>>>> Hello,
>>>>>
>>>>> What is your current setup, 1 server pet data center with 12 osd
>>>>> each? What is your current crush rule and LRC crush rule?
>>>>>
>>>>>
>>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin
>>>>> <michel.jouvin@xxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>   Hi,
>>>>>
>>>>>   I think I found a possible cause of my PG down but still
>>>>>   understand why.
>>>>>   As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9,
>>>>>   m=6) but I have only 12 OSD servers in the cluster. To workaround the
>>>>>   problem I defined the failure domain as 'osd' with the reasoning
>>>>>   that as
>>>>>   I was using the LRC plugin, I had the warranty that I could loose
>>>>>   a site
>>>>>   without impact, thus the possibility to loose 1 OSD server. Am I
>>>>>   wrong?
>>>>>
>>>>>   Best regards,
>>>>>
>>>>>   Michel
>>>>>
>>>>>   Le 24/04/2023 à 13:24, Michel Jouvin a écrit :
>>>>>   > Hi,
>>>>>   >
>>>>>   > I'm still interesting by getting feedback from those using the LRC
>>>>>   > plugin about the right way to configure it... Last week I upgraded
>>>>>   > from Pacific to Quincy (17.2.6) with cephadm which is doing the
>>>>>   > upgrade host by host, checking if an OSD is ok to stop before
>>>>>   actually
>>>>>   > upgrading it. I had the surprise to see 1 or 2 PGs down at some
>>>>>   points
>>>>>   > in the upgrade (happened not for all OSDs but for every
>>>>>   > site/datacenter). Looking at the details with "ceph health
>>>>>   detail", I
>>>>>   > saw that for these PGs there was 3 OSDs down but I was expecting
>>>>>   the
>>>>>   > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm
>>>>>   > wondering if there is something wrong in our pool configuration
>>>>>   (k=9,
>>>>>   > m=6, l=5).
>>>>>   >
>>>>>   > Cheers,
>>>>>   >
>>>>>   > Michel
>>>>>   >
>>>>>   > Le 06/04/2023 à 08:51, Michel Jouvin a écrit :
>>>>>   >> Hi,
>>>>>   >>
>>>>>   >> Is somebody using LRC plugin ?
>>>>>   >>
>>>>>   >> I came to the conclusion that LRC  k=9, m=3, l=4 is not the
>>>>>   same as
>>>>>   >> jerasure k=9, m=6 in terms of protection against failures and
>>>>>   that I
>>>>>   >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure
>>>>>   >> k=9, m=6. The example in the documentation (k=4, m=2, l=3)
>>>>>   suggests
>>>>>   >> that this LRC configuration gives something better than
>>>>>   jerasure k=4,
>>>>>   >> m=2 as it is resilient to 3 drive failures (but not 4 if I
>>>>>   understood
>>>>>   >> properly). So how many drives can fail in the k=9, m=6, l=5
>>>>>   >> configuration first without loosing RW access and second without
>>>>>   >> loosing data?
>>>>>   >>
>>>>>   >> Another thing that I don't quite understand is that a pool created
>>>>>   >> with this configuration (and failure domain=osd,
>>>>>   locality=datacenter)
>>>>>   >> has a min_size=3 (max_size=18 as expected). It seems wrong to
>>>>>   me, I'd
>>>>>   >> expected something ~10 (depending on answer to the previous
>>>>>   question)...
>>>>>   >>
>>>>>   >> Thanks in advance if somebody could provide some sort of
>>>>>   >> authoritative answer on these 2 questions. Best regards,
>>>>>   >>
>>>>>   >> Michel
>>>>>   >>
>>>>>   >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
>>>>>   >>> Answering to myself, I found the reason for 2147483647: it's
>>>>>   >>> documented as a failure to find enough OSD (missing OSDs). And
>>>>>   it is
>>>>>   >>> normal as I selected different hosts for the 15 OSDs but I
>>>>>   have only
>>>>>   >>> 12 hosts!
>>>>>   >>>
>>>>>   >>> I'm still interested by an "expert" to confirm that LRC  k=9,
>>>>>   m=3,
>>>>>   >>> l=4 configuration is equivalent, in terms of redundancy, to a
>>>>>   >>> jerasure configuration with k=9, m=6.
>>>>>   >>>
>>>>>   >>> Michel
>>>>>   >>>
>>>>>   >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit :
>>>>>   >>>> Hi,
>>>>>   >>>>
>>>>>   >>>> As discussed in another thread (Crushmap rule for
>>>>>   multi-datacenter
>>>>>   >>>> erasure coding), I'm trying to create an EC pool spanning 3
>>>>>   >>>> datacenters (datacenters are present in the crushmap), with the
>>>>>   >>>> objective to be resilient to 1 DC down, at least keeping the
>>>>>   >>>> readonly access to the pool and if possible the read-write
>>>>>   access,
>>>>>   >>>> and have a storage efficiency better than 3 replica (let say a
>>>>>   >>>> storage overhead <= 2).
>>>>>   >>>>
>>>>>   >>>> In the discussion, somebody mentioned LRC plugin as a possible
>>>>>   >>>> jerasure alternative to implement this without tweaking the
>>>>>   >>>> crushmap rule to implement the 2-step OSD allocation. I
>>>>>   looked at
>>>>>   >>>> the documentation
>>>>>   >>>>
>>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/)
>>>>>   >>>> but I have some questions if someone has experience/expertise
>>>>>   with
>>>>>   >>>> this LRC plugin.
>>>>>   >>>>
>>>>>   >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in
>>>>>   >>>> total), with 3 (9 in total) being data chunks and others being
>>>>>   >>>> coding chunks. For this, based of my understanding of
>>>>>   examples, I
>>>>>   >>>> used k=9, m=3, l=4. Is it right? Is this configuration
>>>>>   equivalent,
>>>>>   >>>> in terms of redundancy, to a jerasure configuration with k=9,
>>>>>   m=6?
>>>>>   >>>>
>>>>>   >>>> The resulting rule, which looks correct to me, is:
>>>>>   >>>>
>>>>>   >>>> --------
>>>>>   >>>>
>>>>>   >>>> {
>>>>>   >>>>     "rule_id": 6,
>>>>>   >>>>     "rule_name": "test_lrc_2",
>>>>>   >>>>     "ruleset": 6,
>>>>>   >>>>     "type": 3,
>>>>>   >>>>     "min_size": 3,
>>>>>   >>>>     "max_size": 15,
>>>>>   >>>>     "steps": [
>>>>>   >>>>         {
>>>>>   >>>>             "op": "set_chooseleaf_tries",
>>>>>   >>>>             "num": 5
>>>>>   >>>>         },
>>>>>   >>>>         {
>>>>>   >>>>             "op": "set_choose_tries",
>>>>>   >>>>             "num": 100
>>>>>   >>>>         },
>>>>>   >>>>         {
>>>>>   >>>>             "op": "take",
>>>>>   >>>>             "item": -4,
>>>>>   >>>>             "item_name": "default~hdd"
>>>>>   >>>>         },
>>>>>   >>>>         {
>>>>>   >>>>             "op": "choose_indep",
>>>>>   >>>>             "num": 3,
>>>>>   >>>>             "type": "datacenter"
>>>>>   >>>>         },
>>>>>   >>>>         {
>>>>>   >>>>             "op": "chooseleaf_indep",
>>>>>   >>>>             "num": 5,
>>>>>   >>>>             "type": "host"
>>>>>   >>>>         },
>>>>>   >>>>         {
>>>>>   >>>>             "op": "emit"
>>>>>   >>>>         }
>>>>>   >>>>     ]
>>>>>   >>>> }
>>>>>   >>>>
>>>>>   >>>> ------------
>>>>>   >>>>
>>>>>   >>>> Unfortunately, it doesn't work as expected: a pool created with
>>>>>   >>>> this rule ends up with its pages active+undersize, which is
>>>>>   >>>> unexpected for me. Looking at 'ceph health detail` output, I see
>>>>>   >>>> for each page something like:
>>>>>   >>>>
>>>>>   >>>> pg 52.14 is stuck undersized for 27m, current state
>>>>>   >>>> active+undersized, last acting
>>>>>   >>>>
>>>>> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]
>>>>>   >>>>
>>>>>   >>>> For each PG, there is 3 '2147483647' entries and I guess it
>>>>>   is the
>>>>>   >>>> reason of the problem. What are these entries about? Clearly
>>>>>   it is
>>>>>   >>>> not OSD entries... Looks like a negative number, -1, which in
>>>>>   terms
>>>>>   >>>> of crushmap ID is the crushmap root (named "default" in our
>>>>>   >>>> configuration). Any trivial mistake I would have made?
>>>>>   >>>>
>>>>>   >>>> Thanks in advance for any help or for sharing any successful
>>>>>   >>>> configuration?
>>>>>   >>>>
>>>>>   >>>> Best regards,
>>>>>   >>>>
>>>>>   >>>> Michel
>>>>>   >>>> _______________________________________________
>>>>>   >>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>   >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>   _______________________________________________
>>>>>   ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>   To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx