Yep, reading but not using LRC. Please keep it on the ceph user list for future reference -- thanks! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock@xxxxxx> Sent: Thursday, May 4, 2023 3:07 PM To: ceph-users@xxxxxxx Subject: Re: Help needed to configure erasure coding LRC plugin Hi, I don't think you've shared your osd tree yet, could you do that? Apparently nobody else but us reads this thread or nobody reading this uses the LRC plugin. ;-) Thanks, Eugen Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>: > Hi, > > I had to restart one of my OSD server today and the problem showed > up again. This time I managed to capture "ceph health detail" output > showing the problem with the 2 PGs: > > [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down > pg 56.1 is down, acting > [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] > pg 56.12 is down, acting > [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] > > I still doesn't understand why, if I am supposed to survive to a > datacenter failure, I cannot survive to 3 OSDs down on the same > host, hosting shards for the PG. In the second case it is only 2 > OSDs down but I'm surprised they don't seem in the same "group" of > OSD (I'd expected all the the OSDs of one datacenter to be in the > same groupe of 5 if the order given really reflects the allocation > done... > > Still interested by some explanation on what I'm doing wrong! Best regards, > > Michel > > Le 03/05/2023 à 10:21, Eugen Block a écrit : >> I think I got it wrong with the locality setting, I'm still limited >> by the number of hosts I have available in my test cluster, but as >> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with >> locality=datacenter could fit your requirement, at least with >> regards to the recovery bandwidth usage between DCs, but the >> resiliency would not match your requirement (one DC failure). That >> profile creates 3 groups of 4 chunks (3 data/coding chunks and one >> parity chunk) across three DCs, in total 12 chunks. The min_size=7 >> would not allow an entire DC to go down, I'm afraid, you'd have to >> reduce it to 6 to allow reads/writes in a disaster scenario. I'm >> still not sure if I got it right this time, but maybe you're better >> off without the LRC plugin with the limited number of hosts. >> Instead you could use the jerasure plugin with a profile like k=4 >> m=5 allowing an entire DC to fail without losing data access (we >> have one customer using that). >> >> Zitat von Eugen Block <eblock@xxxxxx>: >> >>> Hi, >>> >>> disclaimer: I haven't used LRC in a real setup yet, so there might >>> be some misunderstandings on my side. But I tried to play around >>> with one of my test clusters (Nautilus). Because I'm limited in >>> the number of hosts (6 across 3 virtual DCs) I tried two different >>> profiles with lower numbers to get a feeling for how that works. >>> >>> # first attempt >>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>> k=4 m=2 l=3 crush-failure-domain=host >>> >>> For every third OSD one parity chunk is added, so 2 more chunks to >>> store ==> 8 chunks in total. Since my failure-domain is host and I >>> only have 6 I get incomplete PGs. >>> >>> # second attempt >>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc >>> k=2 m=2 l=2 crush-failure-domain=host >>> >>> This gives me 6 chunks in total to store across 6 hosts which works: >>> >>> ceph:~ # ceph pg ls-by-pool lrcpool >>> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* >>> OMAP_KEYS* LOG STATE SINCE VERSION REPORTED >>> UP ACTING SCRUB_STAMP >>> DEEP_SCRUB_STAMP >>> 50.0 1 0 0 0 619 0 0 1 >>> active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 >>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 >>> 14:53:54.322135 >>> 50.1 0 0 0 0 0 0 0 0 >>> active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 >>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 >>> 14:53:54.322135 >>> 50.2 0 0 0 0 0 0 0 0 >>> active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 >>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 >>> 14:53:54.322135 >>> 50.3 0 0 0 0 0 0 0 0 >>> active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 >>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 >>> 14:53:54.322135 >>> >>> After stopping all OSDs on one host I was still able to read and >>> write into the pool, but after stopping a second host one PG from >>> that pool went "down". That I don't fully understand yet, but I >>> just started to look into it. >>> With your setup (12 hosts) I would recommend to not utilize all of >>> them so you have capacity to recover, let's say one "spare" host >>> per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could >>> make sense here, resulting in 9 total chunks (one more parity >>> chunks for every other OSD), min_size 4. But as I wrote, it >>> probably doesn't have the resiliency for a DC failure, so that >>> needs some further investigation. >>> >>> Regards, >>> Eugen >>> >>> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>: >>> >>>> Hi, >>>> >>>> No... our current setup is 3 datacenters with the same >>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. >>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must >>>> be a multiple of l, I found that k=9/m=66/l=5 with >>>> crush-locality=datacenter was achieving my goal of being >>>> resilient to a datacenter failure. Because I had this, I >>>> considered that lowering the crush failure domain to osd was not >>>> a major issue in my case (as it would not be worst than a >>>> datacenter failure if all the shards are on the same server in a >>>> datacenter) and was working around the lack of hosts for k=9/m=6 >>>> (15 OSDs). >>>> >>>> May be it helps, if I give the erasure code profile used: >>>> >>>> crush-device-class=hdd >>>> crush-failure-domain=osd >>>> crush-locality=datacenter >>>> crush-root=default >>>> k=9 >>>> l=5 >>>> m=6 >>>> plugin=lrc >>>> >>>> The previously mentioned strange number for min_size for the pool >>>> created with this profile has vanished after Quincy upgrade as >>>> this parameter is no longer in the CRUH map rule! and the `ceph >>>> osd pool get` command reports the expected number (10): >>>> >>>> --------- >>>> >>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size >>>> min_size: 10 >>>> -------- >>>> >>>> Cheers, >>>> >>>> Michel >>>> >>>> Le 29/04/2023 à 20:36, Curt a écrit : >>>>> Hello, >>>>> >>>>> What is your current setup, 1 server pet data center with 12 osd >>>>> each? What is your current crush rule and LRC crush rule? >>>>> >>>>> >>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin >>>>> <michel.jouvin@xxxxxxxxxxxxxxx> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I think I found a possible cause of my PG down but still >>>>> understand why. >>>>> As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9, >>>>> m=6) but I have only 12 OSD servers in the cluster. To workaround the >>>>> problem I defined the failure domain as 'osd' with the reasoning >>>>> that as >>>>> I was using the LRC plugin, I had the warranty that I could loose >>>>> a site >>>>> without impact, thus the possibility to loose 1 OSD server. Am I >>>>> wrong? >>>>> >>>>> Best regards, >>>>> >>>>> Michel >>>>> >>>>> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : >>>>> > Hi, >>>>> > >>>>> > I'm still interesting by getting feedback from those using the LRC >>>>> > plugin about the right way to configure it... Last week I upgraded >>>>> > from Pacific to Quincy (17.2.6) with cephadm which is doing the >>>>> > upgrade host by host, checking if an OSD is ok to stop before >>>>> actually >>>>> > upgrading it. I had the surprise to see 1 or 2 PGs down at some >>>>> points >>>>> > in the upgrade (happened not for all OSDs but for every >>>>> > site/datacenter). Looking at the details with "ceph health >>>>> detail", I >>>>> > saw that for these PGs there was 3 OSDs down but I was expecting >>>>> the >>>>> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm >>>>> > wondering if there is something wrong in our pool configuration >>>>> (k=9, >>>>> > m=6, l=5). >>>>> > >>>>> > Cheers, >>>>> > >>>>> > Michel >>>>> > >>>>> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : >>>>> >> Hi, >>>>> >> >>>>> >> Is somebody using LRC plugin ? >>>>> >> >>>>> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the >>>>> same as >>>>> >> jerasure k=9, m=6 in terms of protection against failures and >>>>> that I >>>>> >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure >>>>> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) >>>>> suggests >>>>> >> that this LRC configuration gives something better than >>>>> jerasure k=4, >>>>> >> m=2 as it is resilient to 3 drive failures (but not 4 if I >>>>> understood >>>>> >> properly). So how many drives can fail in the k=9, m=6, l=5 >>>>> >> configuration first without loosing RW access and second without >>>>> >> loosing data? >>>>> >> >>>>> >> Another thing that I don't quite understand is that a pool created >>>>> >> with this configuration (and failure domain=osd, >>>>> locality=datacenter) >>>>> >> has a min_size=3 (max_size=18 as expected). It seems wrong to >>>>> me, I'd >>>>> >> expected something ~10 (depending on answer to the previous >>>>> question)... >>>>> >> >>>>> >> Thanks in advance if somebody could provide some sort of >>>>> >> authoritative answer on these 2 questions. Best regards, >>>>> >> >>>>> >> Michel >>>>> >> >>>>> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : >>>>> >>> Answering to myself, I found the reason for 2147483647: it's >>>>> >>> documented as a failure to find enough OSD (missing OSDs). And >>>>> it is >>>>> >>> normal as I selected different hosts for the 15 OSDs but I >>>>> have only >>>>> >>> 12 hosts! >>>>> >>> >>>>> >>> I'm still interested by an "expert" to confirm that LRC k=9, >>>>> m=3, >>>>> >>> l=4 configuration is equivalent, in terms of redundancy, to a >>>>> >>> jerasure configuration with k=9, m=6. >>>>> >>> >>>>> >>> Michel >>>>> >>> >>>>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : >>>>> >>>> Hi, >>>>> >>>> >>>>> >>>> As discussed in another thread (Crushmap rule for >>>>> multi-datacenter >>>>> >>>> erasure coding), I'm trying to create an EC pool spanning 3 >>>>> >>>> datacenters (datacenters are present in the crushmap), with the >>>>> >>>> objective to be resilient to 1 DC down, at least keeping the >>>>> >>>> readonly access to the pool and if possible the read-write >>>>> access, >>>>> >>>> and have a storage efficiency better than 3 replica (let say a >>>>> >>>> storage overhead <= 2). >>>>> >>>> >>>>> >>>> In the discussion, somebody mentioned LRC plugin as a possible >>>>> >>>> jerasure alternative to implement this without tweaking the >>>>> >>>> crushmap rule to implement the 2-step OSD allocation. I >>>>> looked at >>>>> >>>> the documentation >>>>> >>>> >>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) >>>>> >>>> but I have some questions if someone has experience/expertise >>>>> with >>>>> >>>> this LRC plugin. >>>>> >>>> >>>>> >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in >>>>> >>>> total), with 3 (9 in total) being data chunks and others being >>>>> >>>> coding chunks. For this, based of my understanding of >>>>> examples, I >>>>> >>>> used k=9, m=3, l=4. Is it right? Is this configuration >>>>> equivalent, >>>>> >>>> in terms of redundancy, to a jerasure configuration with k=9, >>>>> m=6? >>>>> >>>> >>>>> >>>> The resulting rule, which looks correct to me, is: >>>>> >>>> >>>>> >>>> -------- >>>>> >>>> >>>>> >>>> { >>>>> >>>> "rule_id": 6, >>>>> >>>> "rule_name": "test_lrc_2", >>>>> >>>> "ruleset": 6, >>>>> >>>> "type": 3, >>>>> >>>> "min_size": 3, >>>>> >>>> "max_size": 15, >>>>> >>>> "steps": [ >>>>> >>>> { >>>>> >>>> "op": "set_chooseleaf_tries", >>>>> >>>> "num": 5 >>>>> >>>> }, >>>>> >>>> { >>>>> >>>> "op": "set_choose_tries", >>>>> >>>> "num": 100 >>>>> >>>> }, >>>>> >>>> { >>>>> >>>> "op": "take", >>>>> >>>> "item": -4, >>>>> >>>> "item_name": "default~hdd" >>>>> >>>> }, >>>>> >>>> { >>>>> >>>> "op": "choose_indep", >>>>> >>>> "num": 3, >>>>> >>>> "type": "datacenter" >>>>> >>>> }, >>>>> >>>> { >>>>> >>>> "op": "chooseleaf_indep", >>>>> >>>> "num": 5, >>>>> >>>> "type": "host" >>>>> >>>> }, >>>>> >>>> { >>>>> >>>> "op": "emit" >>>>> >>>> } >>>>> >>>> ] >>>>> >>>> } >>>>> >>>> >>>>> >>>> ------------ >>>>> >>>> >>>>> >>>> Unfortunately, it doesn't work as expected: a pool created with >>>>> >>>> this rule ends up with its pages active+undersize, which is >>>>> >>>> unexpected for me. Looking at 'ceph health detail` output, I see >>>>> >>>> for each page something like: >>>>> >>>> >>>>> >>>> pg 52.14 is stuck undersized for 27m, current state >>>>> >>>> active+undersized, last acting >>>>> >>>> >>>>> [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] >>>>> >>>> >>>>> >>>> For each PG, there is 3 '2147483647' entries and I guess it >>>>> is the >>>>> >>>> reason of the problem. What are these entries about? Clearly >>>>> it is >>>>> >>>> not OSD entries... Looks like a negative number, -1, which in >>>>> terms >>>>> >>>> of crushmap ID is the crushmap root (named "default" in our >>>>> >>>> configuration). Any trivial mistake I would have made? >>>>> >>>> >>>>> >>>> Thanks in advance for any help or for sharing any successful >>>>> >>>> configuration? >>>>> >>>> >>>>> >>>> Best regards, >>>>> >>>> >>>>> >>>> Michel >>>>> >>>> _______________________________________________ >>>>> >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx