Hello, Correct me if I'm wrong, but you have 18 servers between the 3 data centers? 9+5+6 is 20 chunks, so you still need 2 more servers to support that setup. On Wed, 9 Oct 2024, 17:38 Michel Jouvin, <michel.jouvin@xxxxxxxxxxxxxxx> wrote: > Hi, > > I am resurrecting this old thread that I started 18 months ago after > some new tests. I stopped my initial tests as the cluster I was using > had not enough OSD to use 'host' as the failure domain. Thus I was using > 'osd' as the failure domain and I understood it was unusual and probably > not expected to work... > > Recently, in another cluster with 3 datacenters and 6 servers (with 18 > to 24 OSDs per server) in each datacenter, I gave the LRC plugin another > try. And the same happened again after that one of the datacenters went > down: all PGs from the EC pool using the LRC plugin went down. I don't > really understand the reason but I was wondering if this plugin, which > is still documented, is really supported and supposed to work in Reef? > If not, I would like to avoid spending too much time troubleshooting > it... If somebody is successfully using it, I'm interested to hear it! > > My erasure code profile definition is: > > crush-device-class=hdd > crush-failure-domain=host > crush-locality=datacenter > crush-root=default > k=9 > l=5 > m=6 > plugin=lrc > > Best regards, > > Michel > > Le 04/05/2023 à 12:51, Michel Jouvin a écrit : > > Hi, > > > > I had to restart one of my OSD server today and the problem showed up > > again. This time I managed to capture "ceph health detail" output > > showing the problem with the 2 PGs: > > > > [WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 > > pgs down > > pg 56.1 is down, acting > > [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] > > pg 56.12 is down, acting > > > [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] > > > > I still doesn't understand why, if I am supposed to survive to a > > datacenter failure, I cannot survive to 3 OSDs down on the same host, > > hosting shards for the PG. In the second case it is only 2 OSDs down > > but I'm surprised they don't seem in the same "group" of OSD (I'd > > expected all the the OSDs of one datacenter to be in the same groupe > > of 5 if the order given really reflects the allocation done... > > > > Still interested by some explanation on what I'm doing wrong! Best > > regards, > > > > Michel > > > > Le 03/05/2023 à 10:21, Eugen Block a écrit : > >> I think I got it wrong with the locality setting, I'm still limited > >> by the number of hosts I have available in my test cluster, but as > >> far as I got with failure-domain=osd I believe k=6, m=3, l=3 with > >> locality=datacenter could fit your requirement, at least with regards > >> to the recovery bandwidth usage between DCs, but the resiliency would > >> not match your requirement (one DC failure). That profile creates 3 > >> groups of 4 chunks (3 data/coding chunks and one parity chunk) across > >> three DCs, in total 12 chunks. The min_size=7 would not allow an > >> entire DC to go down, I'm afraid, you'd have to reduce it to 6 to > >> allow reads/writes in a disaster scenario. I'm still not sure if I > >> got it right this time, but maybe you're better off without the LRC > >> plugin with the limited number of hosts. Instead you could use the > >> jerasure plugin with a profile like k=4 m=5 allowing an entire DC to > >> fail without losing data access (we have one customer using that). > >> > >> Zitat von Eugen Block <eblock@xxxxxx>: > >> > >>> Hi, > >>> > >>> disclaimer: I haven't used LRC in a real setup yet, so there might > >>> be some misunderstandings on my side. But I tried to play around > >>> with one of my test clusters (Nautilus). Because I'm limited in the > >>> number of hosts (6 across 3 virtual DCs) I tried two different > >>> profiles with lower numbers to get a feeling for how that works. > >>> > >>> # first attempt > >>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4 > >>> m=2 l=3 crush-failure-domain=host > >>> > >>> For every third OSD one parity chunk is added, so 2 more chunks to > >>> store ==> 8 chunks in total. Since my failure-domain is host and I > >>> only have 6 I get incomplete PGs. > >>> > >>> # second attempt > >>> ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2 > >>> m=2 l=2 crush-failure-domain=host > >>> > >>> This gives me 6 chunks in total to store across 6 hosts which works: > >>> > >>> ceph:~ # ceph pg ls-by-pool lrcpool > >>> PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* > >>> LOG STATE SINCE VERSION REPORTED UP ACTING > >>> SCRUB_STAMP DEEP_SCRUB_STAMP > >>> 50.0 1 0 0 0 619 0 0 1 > >>> active+clean 72s 18410'1 18415:54 [27,13,0,2,25,7]p27 > >>> [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 > >>> 14:53:54.322135 > >>> 50.1 0 0 0 0 0 0 0 0 > >>> active+clean 6m 0'0 18414:26 [27,33,22,6,13,34]p27 > >>> [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 > >>> 14:53:54.322135 > >>> 50.2 0 0 0 0 0 0 0 0 > >>> active+clean 6m 0'0 18413:25 [1,28,14,4,31,21]p1 > >>> [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 > >>> 14:53:54.322135 > >>> 50.3 0 0 0 0 0 0 0 0 > >>> active+clean 6m 0'0 18413:24 [8,16,26,33,7,25]p8 > >>> [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 > >>> 14:53:54.322135 > >>> > >>> After stopping all OSDs on one host I was still able to read and > >>> write into the pool, but after stopping a second host one PG from > >>> that pool went "down". That I don't fully understand yet, but I just > >>> started to look into it. > >>> With your setup (12 hosts) I would recommend to not utilize all of > >>> them so you have capacity to recover, let's say one "spare" host per > >>> DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make > >>> sense here, resulting in 9 total chunks (one more parity chunks for > >>> every other OSD), min_size 4. But as I wrote, it probably doesn't > >>> have the resiliency for a DC failure, so that needs some further > >>> investigation. > >>> > >>> Regards, > >>> Eugen > >>> > >>> Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>: > >>> > >>>> Hi, > >>>> > >>>> No... our current setup is 3 datacenters with the same > >>>> configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. > >>>> Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be > >>>> a multiple of l, I found that k=9/m=66/l=5 with > >>>> crush-locality=datacenter was achieving my goal of being resilient > >>>> to a datacenter failure. Because I had this, I considered that > >>>> lowering the crush failure domain to osd was not a major issue in > >>>> my case (as it would not be worst than a datacenter failure if all > >>>> the shards are on the same server in a datacenter) and was working > >>>> around the lack of hosts for k=9/m=6 (15 OSDs). > >>>> > >>>> May be it helps, if I give the erasure code profile used: > >>>> > >>>> crush-device-class=hdd > >>>> crush-failure-domain=osd > >>>> crush-locality=datacenter > >>>> crush-root=default > >>>> k=9 > >>>> l=5 > >>>> m=6 > >>>> plugin=lrc > >>>> > >>>> The previously mentioned strange number for min_size for the pool > >>>> created with this profile has vanished after Quincy upgrade as this > >>>> parameter is no longer in the CRUH map rule! and the `ceph osd pool > >>>> get` command reports the expected number (10): > >>>> > >>>> --------- > >>>> > >>>>> ceph osd pool get fink-z1.rgw.buckets.data min_size > >>>> min_size: 10 > >>>> -------- > >>>> > >>>> Cheers, > >>>> > >>>> Michel > >>>> > >>>> Le 29/04/2023 à 20:36, Curt a écrit : > >>>>> Hello, > >>>>> > >>>>> What is your current setup, 1 server pet data center with 12 osd > >>>>> each? What is your current crush rule and LRC crush rule? > >>>>> > >>>>> > >>>>> On Fri, Apr 28, 2023, 12:29 Michel Jouvin > >>>>> <michel.jouvin@xxxxxxxxxxxxxxx> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> I think I found a possible cause of my PG down but still > >>>>> understand why. > >>>>> As explained in a previous mail, I setup a 15-chunk/OSD EC pool > >>>>> (k=9, > >>>>> m=6) but I have only 12 OSD servers in the cluster. To > >>>>> workaround the > >>>>> problem I defined the failure domain as 'osd' with the reasoning > >>>>> that as > >>>>> I was using the LRC plugin, I had the warranty that I could loose > >>>>> a site > >>>>> without impact, thus the possibility to loose 1 OSD server. Am I > >>>>> wrong? > >>>>> > >>>>> Best regards, > >>>>> > >>>>> Michel > >>>>> > >>>>> Le 24/04/2023 à 13:24, Michel Jouvin a écrit : > >>>>> > Hi, > >>>>> > > >>>>> > I'm still interesting by getting feedback from those using the > >>>>> LRC > >>>>> > plugin about the right way to configure it... Last week I > >>>>> upgraded > >>>>> > from Pacific to Quincy (17.2.6) with cephadm which is doing the > >>>>> > upgrade host by host, checking if an OSD is ok to stop before > >>>>> actually > >>>>> > upgrading it. I had the surprise to see 1 or 2 PGs down at some > >>>>> points > >>>>> > in the upgrade (happened not for all OSDs but for every > >>>>> > site/datacenter). Looking at the details with "ceph health > >>>>> detail", I > >>>>> > saw that for these PGs there was 3 OSDs down but I was expecting > >>>>> the > >>>>> > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm > >>>>> > wondering if there is something wrong in our pool configuration > >>>>> (k=9, > >>>>> > m=6, l=5). > >>>>> > > >>>>> > Cheers, > >>>>> > > >>>>> > Michel > >>>>> > > >>>>> > Le 06/04/2023 à 08:51, Michel Jouvin a écrit : > >>>>> >> Hi, > >>>>> >> > >>>>> >> Is somebody using LRC plugin ? > >>>>> >> > >>>>> >> I came to the conclusion that LRC k=9, m=3, l=4 is not the > >>>>> same as > >>>>> >> jerasure k=9, m=6 in terms of protection against failures and > >>>>> that I > >>>>> >> should use k=9, m=6, l=5 to get a level of resilience >= > >>>>> jerasure > >>>>> >> k=9, m=6. The example in the documentation (k=4, m=2, l=3) > >>>>> suggests > >>>>> >> that this LRC configuration gives something better than > >>>>> jerasure k=4, > >>>>> >> m=2 as it is resilient to 3 drive failures (but not 4 if I > >>>>> understood > >>>>> >> properly). So how many drives can fail in the k=9, m=6, l=5 > >>>>> >> configuration first without loosing RW access and second without > >>>>> >> loosing data? > >>>>> >> > >>>>> >> Another thing that I don't quite understand is that a pool > >>>>> created > >>>>> >> with this configuration (and failure domain=osd, > >>>>> locality=datacenter) > >>>>> >> has a min_size=3 (max_size=18 as expected). It seems wrong to > >>>>> me, I'd > >>>>> >> expected something ~10 (depending on answer to the previous > >>>>> question)... > >>>>> >> > >>>>> >> Thanks in advance if somebody could provide some sort of > >>>>> >> authoritative answer on these 2 questions. Best regards, > >>>>> >> > >>>>> >> Michel > >>>>> >> > >>>>> >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit : > >>>>> >>> Answering to myself, I found the reason for 2147483647: it's > >>>>> >>> documented as a failure to find enough OSD (missing OSDs). And > >>>>> it is > >>>>> >>> normal as I selected different hosts for the 15 OSDs but I > >>>>> have only > >>>>> >>> 12 hosts! > >>>>> >>> > >>>>> >>> I'm still interested by an "expert" to confirm that LRC k=9, > >>>>> m=3, > >>>>> >>> l=4 configuration is equivalent, in terms of redundancy, to a > >>>>> >>> jerasure configuration with k=9, m=6. > >>>>> >>> > >>>>> >>> Michel > >>>>> >>> > >>>>> >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit : > >>>>> >>>> Hi, > >>>>> >>>> > >>>>> >>>> As discussed in another thread (Crushmap rule for > >>>>> multi-datacenter > >>>>> >>>> erasure coding), I'm trying to create an EC pool spanning 3 > >>>>> >>>> datacenters (datacenters are present in the crushmap), with > >>>>> the > >>>>> >>>> objective to be resilient to 1 DC down, at least keeping the > >>>>> >>>> readonly access to the pool and if possible the read-write > >>>>> access, > >>>>> >>>> and have a storage efficiency better than 3 replica (let say a > >>>>> >>>> storage overhead <= 2). > >>>>> >>>> > >>>>> >>>> In the discussion, somebody mentioned LRC plugin as a possible > >>>>> >>>> jerasure alternative to implement this without tweaking the > >>>>> >>>> crushmap rule to implement the 2-step OSD allocation. I > >>>>> looked at > >>>>> >>>> the documentation > >>>>> >>>> > >>>>> (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) > >>>>> >>>> but I have some questions if someone has experience/expertise > >>>>> with > >>>>> >>>> this LRC plugin. > >>>>> >>>> > >>>>> >>>> I tried to create a rule for using 5 OSDs per datacenter > >>>>> (15 in > >>>>> >>>> total), with 3 (9 in total) being data chunks and others being > >>>>> >>>> coding chunks. For this, based of my understanding of > >>>>> examples, I > >>>>> >>>> used k=9, m=3, l=4. Is it right? Is this configuration > >>>>> equivalent, > >>>>> >>>> in terms of redundancy, to a jerasure configuration with k=9, > >>>>> m=6? > >>>>> >>>> > >>>>> >>>> The resulting rule, which looks correct to me, is: > >>>>> >>>> > >>>>> >>>> -------- > >>>>> >>>> > >>>>> >>>> { > >>>>> >>>> "rule_id": 6, > >>>>> >>>> "rule_name": "test_lrc_2", > >>>>> >>>> "ruleset": 6, > >>>>> >>>> "type": 3, > >>>>> >>>> "min_size": 3, > >>>>> >>>> "max_size": 15, > >>>>> >>>> "steps": [ > >>>>> >>>> { > >>>>> >>>> "op": "set_chooseleaf_tries", > >>>>> >>>> "num": 5 > >>>>> >>>> }, > >>>>> >>>> { > >>>>> >>>> "op": "set_choose_tries", > >>>>> >>>> "num": 100 > >>>>> >>>> }, > >>>>> >>>> { > >>>>> >>>> "op": "take", > >>>>> >>>> "item": -4, > >>>>> >>>> "item_name": "default~hdd" > >>>>> >>>> }, > >>>>> >>>> { > >>>>> >>>> "op": "choose_indep", > >>>>> >>>> "num": 3, > >>>>> >>>> "type": "datacenter" > >>>>> >>>> }, > >>>>> >>>> { > >>>>> >>>> "op": "chooseleaf_indep", > >>>>> >>>> "num": 5, > >>>>> >>>> "type": "host" > >>>>> >>>> }, > >>>>> >>>> { > >>>>> >>>> "op": "emit" > >>>>> >>>> } > >>>>> >>>> ] > >>>>> >>>> } > >>>>> >>>> > >>>>> >>>> ------------ > >>>>> >>>> > >>>>> >>>> Unfortunately, it doesn't work as expected: a pool created > >>>>> with > >>>>> >>>> this rule ends up with its pages active+undersize, which is > >>>>> >>>> unexpected for me. Looking at 'ceph health detail` output, > >>>>> I see > >>>>> >>>> for each page something like: > >>>>> >>>> > >>>>> >>>> pg 52.14 is stuck undersized for 27m, current state > >>>>> >>>> active+undersized, last acting > >>>>> >>>> > >>>>> > [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] > >>>>> > >>>>> >>>> > >>>>> >>>> For each PG, there is 3 '2147483647' entries and I guess it > >>>>> is the > >>>>> >>>> reason of the problem. What are these entries about? Clearly > >>>>> it is > >>>>> >>>> not OSD entries... Looks like a negative number, -1, which in > >>>>> terms > >>>>> >>>> of crushmap ID is the crushmap root (named "default" in our > >>>>> >>>> configuration). Any trivial mistake I would have made? > >>>>> >>>> > >>>>> >>>> Thanks in advance for any help or for sharing any successful > >>>>> >>>> configuration? > >>>>> >>>> > >>>>> >>>> Best regards, > >>>>> >>>> > >>>>> >>>> Michel > >>>>> >>>> _______________________________________________ > >>>>> >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>> _______________________________________________ > >>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx