Re: Help needed to configure erasure coding LRC plugin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I don't think you've shared your osd tree yet, could you do that? Apparently nobody else but us reads this thread or nobody reading this uses the LRC plugin. ;-)

Thanks,
Eugen

Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:

Hi,

I had to restart one of my OSD server today and the problem showed up again. This time I managed to capture "ceph health detail" output showing the problem with the 2 PGs:

[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down
    pg 56.1 is down, acting [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]     pg 56.12 is down, acting [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]

I still doesn't understand why, if I am supposed to survive to a datacenter failure, I cannot survive to 3 OSDs down on the same host, hosting shards for the PG. In the second case it is only 2 OSDs down but I'm surprised they don't seem in the same "group" of OSD (I'd expected all the the OSDs of one datacenter to be in the same groupe of 5 if the order given really reflects the allocation done...

Still interested by some explanation on what I'm doing wrong! Best regards,

Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :
I think I got it wrong with the locality setting, I'm still limited by the number of hosts I have available in my test cluster, but as far as I got with failure-domain=osd I believe k=6, m=3, l=3 with locality=datacenter could fit your requirement, at least with regards to the recovery bandwidth usage between DCs, but the resiliency would not match your requirement (one DC failure). That profile creates 3 groups of 4 chunks (3 data/coding chunks and one parity chunk) across three DCs, in total 12 chunks. The min_size=7 would not allow an entire DC to go down, I'm afraid, you'd have to reduce it to 6 to allow reads/writes in a disaster scenario. I'm still not sure if I got it right this time, but maybe you're better off without the LRC plugin with the limited number of hosts. Instead you could use the jerasure plugin with a profile like k=4 m=5 allowing an entire DC to fail without losing data access (we have one customer using that).

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might be some misunderstandings on my side. But I tried to play around with one of my test clusters (Nautilus). Because I'm limited in the number of hosts (6 across 3 virtual DCs) I tried two different profiles with lower numbers to get a feeling for how that works.

# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4 m=2 l=3 crush-failure-domain=host

For every third OSD one parity chunk is added, so 2 more chunks to store ==> 8 chunks in total. Since my failure-domain is host and I only have 6 I get incomplete PGs.

# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2 m=2 l=2 crush-failure-domain=host

This gives me 6 chunks in total to store across 6 hosts which works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE        SINCE VERSION REPORTED UP                    ACTING SCRUB_STAMP                DEEP_SCRUB_STAMP 50.0       1        0         0       0   619 0          0   1 active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27   [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.1       0        0         0       0     0 0          0   0 active+clean    6m     0'0 18414:26 [27,33,22,6,13,34]p27 [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.2       0        0         0       0     0 0          0   0 active+clean    6m     0'0 18413:25 [1,28,14,4,31,21]p1   [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.3       0        0         0       0     0 0          0   0 active+clean    6m     0'0 18413:24 [8,16,26,33,7,25]p8   [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135

After stopping all OSDs on one host I was still able to read and write into the pool, but after stopping a second host one PG from that pool went "down". That I don't fully understand yet, but I just started to look into it. With your setup (12 hosts) I would recommend to not utilize all of them so you have capacity to recover, let's say one "spare" host per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make sense here, resulting in 9 total chunks (one more parity chunks for every other OSD), min_size 4. But as I wrote, it probably doesn't have the resiliency for a DC failure, so that needs some further investigation.

Regards,
Eugen

Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:

Hi,

No... our current setup is 3 datacenters with the same configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be a multiple of l, I found that k=9/m=66/l=5 with crush-locality=datacenter was achieving my goal of being resilient to a datacenter failure. Because I had this, I considered that lowering the crush failure domain to osd was not a major issue in my case (as it would not be worst than a datacenter failure if all the shards are on the same server in a datacenter) and was working around the lack of hosts for k=9/m=6 (15 OSDs).

May be it helps, if I give the erasure code profile used:

crush-device-class=hdd
crush-failure-domain=osd
crush-locality=datacenter
crush-root=default
k=9
l=5
m=6
plugin=lrc

The previously mentioned strange number for min_size for the pool created with this profile has vanished after Quincy upgrade as this parameter is no longer in the CRUH map rule! and the `ceph osd pool get` command reports the expected number (10):

---------

ceph osd pool get fink-z1.rgw.buckets.data min_size
min_size: 10
--------

Cheers,

Michel

Le 29/04/2023 à 20:36, Curt a écrit :
Hello,

What is your current setup, 1 server pet data center with 12 osd each? What is your current crush rule and LRC crush rule?


On Fri, Apr 28, 2023, 12:29 Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx> wrote:

  Hi,

  I think I found a possible cause of my PG down but still
  understand why.
  As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9,
  m=6) but I have only 12 OSD servers in the cluster. To workaround the
  problem I defined the failure domain as 'osd' with the reasoning
  that as
  I was using the LRC plugin, I had the warranty that I could loose
  a site
  without impact, thus the possibility to loose 1 OSD server. Am I
  wrong?

  Best regards,

  Michel

  Le 24/04/2023 à 13:24, Michel Jouvin a écrit :
  > Hi,
  >
  > I'm still interesting by getting feedback from those using the LRC
  > plugin about the right way to configure it... Last week I upgraded
  > from Pacific to Quincy (17.2.6) with cephadm which is doing the
  > upgrade host by host, checking if an OSD is ok to stop before
  actually
  > upgrading it. I had the surprise to see 1 or 2 PGs down at some
  points
  > in the upgrade (happened not for all OSDs but for every
  > site/datacenter). Looking at the details with "ceph health
  detail", I
  > saw that for these PGs there was 3 OSDs down but I was expecting
  the
  > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm
  > wondering if there is something wrong in our pool configuration
  (k=9,
  > m=6, l=5).
  >
  > Cheers,
  >
  > Michel
  >
  > Le 06/04/2023 à 08:51, Michel Jouvin a écrit :
  >> Hi,
  >>
  >> Is somebody using LRC plugin ?
  >>
  >> I came to the conclusion that LRC  k=9, m=3, l=4 is not the
  same as
  >> jerasure k=9, m=6 in terms of protection against failures and
  that I
  >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure
  >> k=9, m=6. The example in the documentation (k=4, m=2, l=3)
  suggests
  >> that this LRC configuration gives something better than
  jerasure k=4,
  >> m=2 as it is resilient to 3 drive failures (but not 4 if I
  understood
  >> properly). So how many drives can fail in the k=9, m=6, l=5
  >> configuration first without loosing RW access and second without
  >> loosing data?
  >>
  >> Another thing that I don't quite understand is that a pool created
  >> with this configuration (and failure domain=osd,
  locality=datacenter)
  >> has a min_size=3 (max_size=18 as expected). It seems wrong to
  me, I'd
  >> expected something ~10 (depending on answer to the previous
  question)...
  >>
  >> Thanks in advance if somebody could provide some sort of
  >> authoritative answer on these 2 questions. Best regards,
  >>
  >> Michel
  >>
  >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
  >>> Answering to myself, I found the reason for 2147483647: it's
  >>> documented as a failure to find enough OSD (missing OSDs). And
  it is
  >>> normal as I selected different hosts for the 15 OSDs but I
  have only
  >>> 12 hosts!
  >>>
  >>> I'm still interested by an "expert" to confirm that LRC  k=9,
  m=3,
  >>> l=4 configuration is equivalent, in terms of redundancy, to a
  >>> jerasure configuration with k=9, m=6.
  >>>
  >>> Michel
  >>>
  >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit :
  >>>> Hi,
  >>>>
  >>>> As discussed in another thread (Crushmap rule for
  multi-datacenter
  >>>> erasure coding), I'm trying to create an EC pool spanning 3
  >>>> datacenters (datacenters are present in the crushmap), with the
  >>>> objective to be resilient to 1 DC down, at least keeping the
  >>>> readonly access to the pool and if possible the read-write
  access,
  >>>> and have a storage efficiency better than 3 replica (let say a
  >>>> storage overhead <= 2).
  >>>>
  >>>> In the discussion, somebody mentioned LRC plugin as a possible
  >>>> jerasure alternative to implement this without tweaking the
  >>>> crushmap rule to implement the 2-step OSD allocation. I
  looked at
  >>>> the documentation
  >>>>
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/)
  >>>> but I have some questions if someone has experience/expertise
  with
  >>>> this LRC plugin.
  >>>>
  >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in
  >>>> total), with 3 (9 in total) being data chunks and others being
  >>>> coding chunks. For this, based of my understanding of
  examples, I
  >>>> used k=9, m=3, l=4. Is it right? Is this configuration
  equivalent,
  >>>> in terms of redundancy, to a jerasure configuration with k=9,
  m=6?
  >>>>
  >>>> The resulting rule, which looks correct to me, is:
  >>>>
  >>>> --------
  >>>>
  >>>> {
  >>>>     "rule_id": 6,
  >>>>     "rule_name": "test_lrc_2",
  >>>>     "ruleset": 6,
  >>>>     "type": 3,
  >>>>     "min_size": 3,
  >>>>     "max_size": 15,
  >>>>     "steps": [
  >>>>         {
  >>>>             "op": "set_chooseleaf_tries",
  >>>>             "num": 5
  >>>>         },
  >>>>         {
  >>>>             "op": "set_choose_tries",
  >>>>             "num": 100
  >>>>         },
  >>>>         {
  >>>>             "op": "take",
  >>>>             "item": -4,
  >>>>             "item_name": "default~hdd"
  >>>>         },
  >>>>         {
  >>>>             "op": "choose_indep",
  >>>>             "num": 3,
  >>>>             "type": "datacenter"
  >>>>         },
  >>>>         {
  >>>>             "op": "chooseleaf_indep",
  >>>>             "num": 5,
  >>>>             "type": "host"
  >>>>         },
  >>>>         {
  >>>>             "op": "emit"
  >>>>         }
  >>>>     ]
  >>>> }
  >>>>
  >>>> ------------
  >>>>
  >>>> Unfortunately, it doesn't work as expected: a pool created with
  >>>> this rule ends up with its pages active+undersize, which is
  >>>> unexpected for me. Looking at 'ceph health detail` output, I see
  >>>> for each page something like:
  >>>>
  >>>> pg 52.14 is stuck undersized for 27m, current state
  >>>> active+undersized, last acting
  >>>>
[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]
  >>>>
  >>>> For each PG, there is 3 '2147483647' entries and I guess it
  is the
  >>>> reason of the problem. What are these entries about? Clearly
  it is
  >>>> not OSD entries... Looks like a negative number, -1, which in
  terms
  >>>> of crushmap ID is the crushmap root (named "default" in our
  >>>> configuration). Any trivial mistake I would have made?
  >>>>
  >>>> Thanks in advance for any help or for sharing any successful
  >>>> configuration?
  >>>>
  >>>> Best regards,
  >>>>
  >>>> Michel
  >>>> _______________________________________________
  >>>> ceph-users mailing list -- ceph-users@xxxxxxx
  >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
  _______________________________________________
  ceph-users mailing list -- ceph-users@xxxxxxx
  To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux