Hi, I've been following this thread with interest as it seems like a unique use case to expand my knowledge. I don't use LRC or anything outside basic erasure coding. What is your current crush steps rule? I know you made changes since your first post and had some thoughts I wanted to share, but wanted to see your rule first so I could try to visualize the distribution better. The only way I can currently visualize it working is with more servers, I'm thinking 6 or 9 per data center min, but that could be my lack of knowledge on some of the step rules. Thanks Curt On Tue, May 16, 2023 at 11:09 AM Michel Jouvin < michel.jouvin@xxxxxxxxxxxxxxx> wrote: > Hi Eugen, > > Yes, sure, no problem to share it. I attach it to this email (as it may > clutter the discussion if inline). > > If somebody on the list has some clue on the LRC plugin, I'm still > interested by understand what I'm doing wrong! > > Cheers, > > Michel > > Le 04/05/2023 à 15:07, Eugen Block a écrit : > > Hi, > > > > I don't think you've shared your osd tree yet, could you do that? > > Apparently nobody else but us reads this thread or nobody reading this > > uses the LRC plugin. ;-) > > > > Thanks, > > Eugen > > > > Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>: > > > >> Hi, > >> > >> I had to restart one of my OSD server today and the problem showed up > >> again. This time I managed to capture "ceph health detail" output > >> showing the problem with the 2 PGs: > >> > >> [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 > >> pgs down > >> pg 56.1 is down, acting > >> [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] > >> pg 56.12 is down, acting > >> > [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] > >> > >> I still doesn't understand why, if I am supposed to survive to a > >> datacenter failure, I cannot survive to 3 OSDs down on the same host, > >> hosting shards for the PG. In the second case it is only 2 OSDs down > >> but I'm surprised they don't seem in the same "group" of OSD (I'd > >> expected all the the OSDs of one datacenter to be in the same groupe > >> of 5 if the order given really reflects the allocation done... > >> > >> Still interested by some explanation on what I'm doing wrong! Best > >> regards, > >> > >> Michel > >> > >> Le 03/05/2023 à 10:21, Eugen Block a écrit : > >>> I think I got it wrong with the locality setting, I'm still limited > >>> by the number of hosts I have available in my test cluster, but as > >>> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with > >>> locality=datacenter could fit your requirement, at least with > >>> regards to the recovery bandwidth usage between DCs, but the > >>> resiliency would not match your requirement (one DC failure). That > >>> profile creates 3 groups of 4 chunks (3 data/coding chunks and one > >>> parity chunk) across three DCs, in total 12 chunks. The min_size=7 > >>> would not allow an entire DC to go down, I'm afraid, you'd have to > >>> reduce it to 6 to allow reads/writes in a disaster scenario. I'm > >>> still not sure if I got it right this time, but maybe you're better > >>> off without the LRC plugin with the limited number of hosts. Instead > >>> you could use the jerasure plugin with a profile like k=4 m=5 > >>> allowing an entire DC to fail without losing data access (we have > >>> one customer using that). > >>> > >>> Zitat von Eugen Block <eblock@xxxxxx>: > >>> > >>>> Hi, > >>>> > >>>> disclaimer: I haven't used LRC in a real setup yet, so there might > >>>> be some misunderstandings on my side. But I tried to play around > >>>> with one of my test clusters (Nautilus). Because I'm limited in the > >>>> number of hosts (6 across 3 virtual DCs) I tried two different > >>>> profiles with lower numbers to get a feeling for how that works. > >>>> > >>>> # first attempt > >>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc > >>>> k=4 m=2 l=3 crush-failure-domain=host > >>>> > >>>> For every third OSD one parity chunk is added, so 2 more chunks to > >>>> store ==> 8 chunks in total. Since my failure-domain is host and I > >>>> only have 6 I get incomplete PGs. > >>>> > >>>> # second attempt > >>>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc > >>>> k=2 m=2 l=2 crush-failure-domain=host > >>>> > >>>> This gives me 6 chunks in total to store across 6 hosts which works: > >>>> > >>>> ceph:~ # ceph pg ls-by-pool lrcpool > >>>> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* > >>>> OMAP_KEYS* LOG STATE SINCE VERSION REPORTED > >>>> UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP > >>>> 50.0 1 0 0 0 619 0 0 1 > >>>> active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 > >>>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 > >>>> 14:53:54.322135 > >>>> 50.1 0 0 0 0 0 0 0 0 > >>>> active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 > >>>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 > >>>> 14:53:54.322135 > >>>> 50.2 0 0 0 0 0 0 0 0 > >>>> active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 > >>>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 > >>>> 14:53:54.322135 > >>>> 50.3 0 0 0 0 0 0 0 0 > >>>> active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 > >>>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 > >>>> 14:53:54.322135 > >>>> > >>>> After stopping all OSDs on one host I was still able to read and > >>>> write into the pool, but after stopping a second host one PG from > >>>> that pool went "down". That I don't fully understand yet, but I > >>>> just started to look into it. > >>>> With your setup (12 hosts) I would recommend to not utilize all of > >>>> them so you have capacity to recover, let's say one "spare" host > >>>> per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could > >>>> make sense here, resulting in 9 total chunks (one more parity > >>>> chunks for every other OSD), min_size 4. But as I wrote, it > >>>> probably doesn't have the resiliency for a DC failure, so that > >>>> needs some further investigation. > >>>> > >>>> Regards, > >>>> Eugen > >>>> > >>>> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>: > >>>> > >>>>> Hi, > >>>>> > >>>>> No... our current setup is 3 datacenters with the same > >>>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. > >>>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be > >>>>> a multiple of l, I found that k=9/m=66/l=5 with > >>>>> crush-locality=datacenter was achieving my goal of being resilient > >>>>> to a datacenter failure. Because I had this, I considered that > >>>>> lowering the crush failure domain to osd was not a major issue in > >>>>> my case (as it would not be worst than a datacenter failure if all > >>>>> the shards are on the same server in a datacenter) and was working > >>>>> around the lack of hosts for k=9/m=6 (15 OSDs). > >>>>> > >>>>> May be it helps, if I give the erasure code profile used: > >>>>> > >>>>> crush-device-class=hdd > >>>>> crush-failure-domain=osd > >>>>> crush-locality=datacenter > >>>>> crush-root=default > >>>>> k=9 > >>>>> l=5 > >>>>> m=6 > >>>>> plugin=lrc > >>>>> > >>>>> The previously mentioned strange number for min_size for the pool > >>>>> created with this profile has vanished after Quincy upgrade as > >>>>> this parameter is no longer in the CRUH map rule! and the `ceph > >>>>> osd pool get` command reports the expected number (10): > >>>>> > >>>>> --------- > >>>>> > >>>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size > >>>>> min_size: 10 > >>>>> -------- > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Michel > >>>>> > >>>>> Le 29/04/2023 à 20:36, Curt a écrit : > >>>>>> Hello, > >>>>>> > >>>>>> What is your current setup, 1 server pet data center with 12 osd > >>>>>> each? What is your current crush rule and LRC crush rule? > >>>>>> > >>>>>> > >>>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin > >>>>>> <michel.jouvin@xxxxxxxxxxxxxxx> wrote: > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> I think I found a possible cause of my PG down but still > >>>>>> understand why. > >>>>>> As explained in a previous mail, I setup a 15-chunk/OSD EC pool > >>>>>> (k=9, > >>>>>> m=6) but I have only 12 OSD servers in the cluster. To > >>>>>> workaround the > >>>>>> problem I defined the failure domain as 'osd' with the reasoning > >>>>>> that as > >>>>>> I was using the LRC plugin, I had the warranty that I could loose > >>>>>> a site > >>>>>> without impact, thus the possibility to loose 1 OSD server. Am I > >>>>>> wrong? > >>>>>> > >>>>>> Best regards, > >>>>>> > >>>>>> Michel > >>>>>> > >>>>>> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : > >>>>>> > Hi, > >>>>>> > > >>>>>> > I'm still interesting by getting feedback from those using > >>>>>> the LRC > >>>>>> > plugin about the right way to configure it... Last week I > >>>>>> upgraded > >>>>>> > from Pacific to Quincy (17.2.6) with cephadm which is doing the > >>>>>> > upgrade host by host, checking if an OSD is ok to stop before > >>>>>> actually > >>>>>> > upgrading it. I had the surprise to see 1 or 2 PGs down at some > >>>>>> points > >>>>>> > in the upgrade (happened not for all OSDs but for every > >>>>>> > site/datacenter). Looking at the details with "ceph health > >>>>>> detail", I > >>>>>> > saw that for these PGs there was 3 OSDs down but I was expecting > >>>>>> the > >>>>>> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm > >>>>>> > wondering if there is something wrong in our pool configuration > >>>>>> (k=9, > >>>>>> > m=6, l=5). > >>>>>> > > >>>>>> > Cheers, > >>>>>> > > >>>>>> > Michel > >>>>>> > > >>>>>> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : > >>>>>> >> Hi, > >>>>>> >> > >>>>>> >> Is somebody using LRC plugin ? > >>>>>> >> > >>>>>> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the > >>>>>> same as > >>>>>> >> jerasure k=9, m=6 in terms of protection against failures and > >>>>>> that I > >>>>>> >> should use k=9, m=6, l=5 to get a level of resilience >= > >>>>>> jerasure > >>>>>> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) > >>>>>> suggests > >>>>>> >> that this LRC configuration gives something better than > >>>>>> jerasure k=4, > >>>>>> >> m=2 as it is resilient to 3 drive failures (but not 4 if I > >>>>>> understood > >>>>>> >> properly). So how many drives can fail in the k=9, m=6, l=5 > >>>>>> >> configuration first without loosing RW access and second > >>>>>> without > >>>>>> >> loosing data? > >>>>>> >> > >>>>>> >> Another thing that I don't quite understand is that a pool > >>>>>> created > >>>>>> >> with this configuration (and failure domain=osd, > >>>>>> locality=datacenter) > >>>>>> >> has a min_size=3 (max_size=18 as expected). It seems wrong to > >>>>>> me, I'd > >>>>>> >> expected something ~10 (depending on answer to the previous > >>>>>> question)... > >>>>>> >> > >>>>>> >> Thanks in advance if somebody could provide some sort of > >>>>>> >> authoritative answer on these 2 questions. Best regards, > >>>>>> >> > >>>>>> >> Michel > >>>>>> >> > >>>>>> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : > >>>>>> >>> Answering to myself, I found the reason for 2147483647: it's > >>>>>> >>> documented as a failure to find enough OSD (missing OSDs). And > >>>>>> it is > >>>>>> >>> normal as I selected different hosts for the 15 OSDs but I > >>>>>> have only > >>>>>> >>> 12 hosts! > >>>>>> >>> > >>>>>> >>> I'm still interested by an "expert" to confirm that LRC k=9, > >>>>>> m=3, > >>>>>> >>> l=4 configuration is equivalent, in terms of redundancy, to a > >>>>>> >>> jerasure configuration with k=9, m=6. > >>>>>> >>> > >>>>>> >>> Michel > >>>>>> >>> > >>>>>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : > >>>>>> >>>> Hi, > >>>>>> >>>> > >>>>>> >>>> As discussed in another thread (Crushmap rule for > >>>>>> multi-datacenter > >>>>>> >>>> erasure coding), I'm trying to create an EC pool spanning 3 > >>>>>> >>>> datacenters (datacenters are present in the crushmap), > >>>>>> with the > >>>>>> >>>> objective to be resilient to 1 DC down, at least keeping the > >>>>>> >>>> readonly access to the pool and if possible the read-write > >>>>>> access, > >>>>>> >>>> and have a storage efficiency better than 3 replica (let > >>>>>> say a > >>>>>> >>>> storage overhead <= 2). > >>>>>> >>>> > >>>>>> >>>> In the discussion, somebody mentioned LRC plugin as a > >>>>>> possible > >>>>>> >>>> jerasure alternative to implement this without tweaking the > >>>>>> >>>> crushmap rule to implement the 2-step OSD allocation. I > >>>>>> looked at > >>>>>> >>>> the documentation > >>>>>> >>>> > >>>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/ > ) > >>>>>> >>>> but I have some questions if someone has experience/expertise > >>>>>> with > >>>>>> >>>> this LRC plugin. > >>>>>> >>>> > >>>>>> >>>> I tried to create a rule for using 5 OSDs per datacenter > >>>>>> (15 in > >>>>>> >>>> total), with 3 (9 in total) being data chunks and others > >>>>>> being > >>>>>> >>>> coding chunks. For this, based of my understanding of > >>>>>> examples, I > >>>>>> >>>> used k=9, m=3, l=4. Is it right? Is this configuration > >>>>>> equivalent, > >>>>>> >>>> in terms of redundancy, to a jerasure configuration with k=9, > >>>>>> m=6? > >>>>>> >>>> > >>>>>> >>>> The resulting rule, which looks correct to me, is: > >>>>>> >>>> > >>>>>> >>>> -------- > >>>>>> >>>> > >>>>>> >>>> { > >>>>>> >>>> "rule_id": 6, > >>>>>> >>>> "rule_name": "test_lrc_2", > >>>>>> >>>> "ruleset": 6, > >>>>>> >>>> "type": 3, > >>>>>> >>>> "min_size": 3, > >>>>>> >>>> "max_size": 15, > >>>>>> >>>> "steps": [ > >>>>>> >>>> { > >>>>>> >>>> "op": "set_chooseleaf_tries", > >>>>>> >>>> "num": 5 > >>>>>> >>>> }, > >>>>>> >>>> { > >>>>>> >>>> "op": "set_choose_tries", > >>>>>> >>>> "num": 100 > >>>>>> >>>> }, > >>>>>> >>>> { > >>>>>> >>>> "op": "take", > >>>>>> >>>> "item": -4, > >>>>>> >>>> "item_name": "default~hdd" > >>>>>> >>>> }, > >>>>>> >>>> { > >>>>>> >>>> "op": "choose_indep", > >>>>>> >>>> "num": 3, > >>>>>> >>>> "type": "datacenter" > >>>>>> >>>> }, > >>>>>> >>>> { > >>>>>> >>>> "op": "chooseleaf_indep", > >>>>>> >>>> "num": 5, > >>>>>> >>>> "type": "host" > >>>>>> >>>> }, > >>>>>> >>>> { > >>>>>> >>>> "op": "emit" > >>>>>> >>>> } > >>>>>> >>>> ] > >>>>>> >>>> } > >>>>>> >>>> > >>>>>> >>>> ------------ > >>>>>> >>>> > >>>>>> >>>> Unfortunately, it doesn't work as expected: a pool created > >>>>>> with > >>>>>> >>>> this rule ends up with its pages active+undersize, which is > >>>>>> >>>> unexpected for me. Looking at 'ceph health detail` output, > >>>>>> I see > >>>>>> >>>> for each page something like: > >>>>>> >>>> > >>>>>> >>>> pg 52.14 is stuck undersized for 27m, current state > >>>>>> >>>> active+undersized, last acting > >>>>>> >>>> > >>>>>> > [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] > >>>>>> > >>>>>> >>>> > >>>>>> >>>> For each PG, there is 3 '2147483647' entries and I guess it > >>>>>> is the > >>>>>> >>>> reason of the problem. What are these entries about? Clearly > >>>>>> it is > >>>>>> >>>> not OSD entries... Looks like a negative number, -1, which in > >>>>>> terms > >>>>>> >>>> of crushmap ID is the crushmap root (named "default" in our > >>>>>> >>>> configuration). Any trivial mistake I would have made? > >>>>>> >>>> > >>>>>> >>>> Thanks in advance for any help or for sharing any successful > >>>>>> >>>> configuration? > >>>>>> >>>> > >>>>>> >>>> Best regards, > >>>>>> >>>> > >>>>>> >>>> Michel > >>>>>> >>>> _______________________________________________ > >>>>>> >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>>> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>>> _______________________________________________ > >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx