Re: Help needed to configure erasure coding LRC plugin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

No... our current setup is 3 datacenters with the same configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be a multiple of l, I found that k=9/m=66/l=5 with crush-locality=datacenter was achieving my goal of being resilient to a datacenter failure. Because I had this, I considered that lowering the crush failure domain to osd was not a major issue in my case (as it would not be worst than a datacenter failure if all the shards are on the same server in a datacenter) and was working around the lack of hosts for k=9/m=6 (15 OSDs).

May be it helps, if I give the erasure code profile used:

crush-device-class=hdd
crush-failure-domain=osd
crush-locality=datacenter
crush-root=default
k=9
l=5
m=6
plugin=lrc

The previously mentioned strange number for min_size for the pool created with this profile has vanished after Quincy upgrade as this parameter is no longer in the CRUH map rule! and the `ceph osd pool get` command reports the expected number (10):

---------

> ceph osd pool get fink-z1.rgw.buckets.data min_size
min_size: 10
--------

Cheers,

Michel

Le 29/04/2023 à 20:36, Curt a écrit :
Hello,

What is your current setup, 1 server pet data center with 12 osd each? What is your current crush rule and LRC crush rule?


On Fri, Apr 28, 2023, 12:29 Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx> wrote:

    Hi,

    I think I found a possible cause of my PG down but still
    understand why.
    As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9,
    m=6) but I have only 12 OSD servers in the cluster. To workaround the
    problem I defined the failure domain as 'osd' with the reasoning
    that as
    I was using the LRC plugin, I had the warranty that I could loose
    a site
    without impact, thus the possibility to loose 1 OSD server. Am I
    wrong?

    Best regards,

    Michel

    Le 24/04/2023 à 13:24, Michel Jouvin a écrit :
    > Hi,
    >
    > I'm still interesting by getting feedback from those using the LRC
    > plugin about the right way to configure it... Last week I upgraded
    > from Pacific to Quincy (17.2.6) with cephadm which is doing the
    > upgrade host by host, checking if an OSD is ok to stop before
    actually
    > upgrading it. I had the surprise to see 1 or 2 PGs down at some
    points
    > in the upgrade (happened not for all OSDs but for every
    > site/datacenter). Looking at the details with "ceph health
    detail", I
    > saw that for these PGs there was 3 OSDs down but I was expecting
    the
    > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm
    > wondering if there is something wrong in our pool configuration
    (k=9,
    > m=6, l=5).
    >
    > Cheers,
    >
    > Michel
    >
    > Le 06/04/2023 à 08:51, Michel Jouvin a écrit :
    >> Hi,
    >>
    >> Is somebody using LRC plugin ?
    >>
    >> I came to the conclusion that LRC  k=9, m=3, l=4 is not the
    same as
    >> jerasure k=9, m=6 in terms of protection against failures and
    that I
    >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure
    >> k=9, m=6. The example in the documentation (k=4, m=2, l=3)
    suggests
    >> that this LRC configuration gives something better than
    jerasure k=4,
    >> m=2 as it is resilient to 3 drive failures (but not 4 if I
    understood
    >> properly). So how many drives can fail in the k=9, m=6, l=5
    >> configuration first without loosing RW access and second without
    >> loosing data?
    >>
    >> Another thing that I don't quite understand is that a pool created
    >> with this configuration (and failure domain=osd,
    locality=datacenter)
    >> has a min_size=3 (max_size=18 as expected). It seems wrong to
    me, I'd
    >> expected something ~10 (depending on answer to the previous
    question)...
    >>
    >> Thanks in advance if somebody could provide some sort of
    >> authoritative answer on these 2 questions. Best regards,
    >>
    >> Michel
    >>
    >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
    >>> Answering to myself, I found the reason for 2147483647: it's
    >>> documented as a failure to find enough OSD (missing OSDs). And
    it is
    >>> normal as I selected different hosts for the 15 OSDs but I
    have only
    >>> 12 hosts!
    >>>
    >>> I'm still interested by an "expert" to confirm that LRC  k=9,
    m=3,
    >>> l=4 configuration is equivalent, in terms of redundancy, to a
    >>> jerasure configuration with k=9, m=6.
    >>>
    >>> Michel
    >>>
    >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit :
    >>>> Hi,
    >>>>
    >>>> As discussed in another thread (Crushmap rule for
    multi-datacenter
    >>>> erasure coding), I'm trying to create an EC pool spanning 3
    >>>> datacenters (datacenters are present in the crushmap), with the
    >>>> objective to be resilient to 1 DC down, at least keeping the
    >>>> readonly access to the pool and if possible the read-write
    access,
    >>>> and have a storage efficiency better than 3 replica (let say a
    >>>> storage overhead <= 2).
    >>>>
    >>>> In the discussion, somebody mentioned LRC plugin as a possible
    >>>> jerasure alternative to implement this without tweaking the
    >>>> crushmap rule to implement the 2-step OSD allocation. I
    looked at
    >>>> the documentation
    >>>>
    (https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/)
    >>>> but I have some questions if someone has experience/expertise
    with
    >>>> this LRC plugin.
    >>>>
    >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in
    >>>> total), with 3 (9 in total) being data chunks and others being
    >>>> coding chunks. For this, based of my understanding of
    examples, I
    >>>> used k=9, m=3, l=4. Is it right? Is this configuration
    equivalent,
    >>>> in terms of redundancy, to a jerasure configuration with k=9,
    m=6?
    >>>>
    >>>> The resulting rule, which looks correct to me, is:
    >>>>
    >>>> --------
    >>>>
    >>>> {
    >>>>     "rule_id": 6,
    >>>>     "rule_name": "test_lrc_2",
    >>>>     "ruleset": 6,
    >>>>     "type": 3,
    >>>>     "min_size": 3,
    >>>>     "max_size": 15,
    >>>>     "steps": [
    >>>>         {
    >>>>             "op": "set_chooseleaf_tries",
    >>>>             "num": 5
    >>>>         },
    >>>>         {
    >>>>             "op": "set_choose_tries",
    >>>>             "num": 100
    >>>>         },
    >>>>         {
    >>>>             "op": "take",
    >>>>             "item": -4,
    >>>>             "item_name": "default~hdd"
    >>>>         },
    >>>>         {
    >>>>             "op": "choose_indep",
    >>>>             "num": 3,
    >>>>             "type": "datacenter"
    >>>>         },
    >>>>         {
    >>>>             "op": "chooseleaf_indep",
    >>>>             "num": 5,
    >>>>             "type": "host"
    >>>>         },
    >>>>         {
    >>>>             "op": "emit"
    >>>>         }
    >>>>     ]
    >>>> }
    >>>>
    >>>> ------------
    >>>>
    >>>> Unfortunately, it doesn't work as expected: a pool created with
    >>>> this rule ends up with its pages active+undersize, which is
    >>>> unexpected for me. Looking at 'ceph health detail` output, I see
    >>>> for each page something like:
    >>>>
    >>>> pg 52.14 is stuck undersized for 27m, current state
    >>>> active+undersized, last acting
    >>>>
    [90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]
    >>>>
    >>>> For each PG, there is 3 '2147483647' entries and I guess it
    is the
    >>>> reason of the problem. What are these entries about? Clearly
    it is
    >>>> not OSD entries... Looks like a negative number, -1, which in
    terms
    >>>> of crushmap ID is the crushmap root (named "default" in our
    >>>> configuration). Any trivial mistake I would have made?
    >>>>
    >>>> Thanks in advance for any help or for sharing any successful
    >>>> configuration?
    >>>>
    >>>> Best regards,
    >>>>
    >>>> Michel
    >>>> _______________________________________________
    >>>> ceph-users mailing list -- ceph-users@xxxxxxx
    >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux