Re: is LRC plugin still maintained/supposed to work in Reef?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I haven't seen any updates in the tracker issue [0], I'm still convinced that LRC doesn't work as expected, but I'd still like to get some confirmation from the devs. The last response via email was that it doesn't have priority, although according to telemetry data it appears to be in use. I'll ping Radek again.

[0] https://tracker.ceph.com/issues/61861

Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:

Hi,

I am resurrecting this old thread that I started 18 months ago after some new tests. I stopped my initial tests as the cluster I was using had not enough OSD to use 'host' as the failure domain. Thus I was using 'osd' as the failure domain and I understood it was unusual and probably not expected to work...

Recently, in another cluster with 3 datacenters and 6 servers (with 18 to 24 OSDs per server) in each datacenter, I gave the LRC plugin another try. And the same happened again after that one of the datacenters went down: all PGs from the EC pool using the LRC plugin went down. I don't really understand the reason but I was wondering if this plugin, which is still documented, is really supported and supposed to work in Reef? If not, I would like to avoid spending too much time troubleshooting it... If somebody is successfully using it, I'm interested to hear it!

My erasure code profile definition is:

crush-device-class=hdd
crush-failure-domain=host
crush-locality=datacenter
crush-root=default
k=9
l=5
m=6
plugin=lrc

Best regards,

Michel

Le 04/05/2023 à 12:51, Michel Jouvin a écrit :
Hi,

I had to restart one of my OSD server today and the problem showed up again. This time I managed to capture "ceph health detail" output showing the problem with the 2 PGs:

[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs inactive, 2 pgs down
    pg 56.1 is down, acting [208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE]     pg 56.12 is down, acting [NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128]

I still doesn't understand why, if I am supposed to survive to a datacenter failure, I cannot survive to 3 OSDs down on the same host, hosting shards for the PG. In the second case it is only 2 OSDs down but I'm surprised they don't seem in the same "group" of OSD (I'd expected all the the OSDs of one datacenter to be in the same groupe of 5 if the order given really reflects the allocation done...

Still interested by some explanation on what I'm doing wrong! Best regards,

Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :
I think I got it wrong with the locality setting, I'm still limited by the number of hosts I have available in my test cluster, but as far as I got with failure-domain=osd I believe k=6, m=3, l=3 with locality=datacenter could fit your requirement, at least with regards to the recovery bandwidth usage between DCs, but the resiliency would not match your requirement (one DC failure). That profile creates 3 groups of 4 chunks (3 data/coding chunks and one parity chunk) across three DCs, in total 12 chunks. The min_size=7 would not allow an entire DC to go down, I'm afraid, you'd have to reduce it to 6 to allow reads/writes in a disaster scenario. I'm still not sure if I got it right this time, but maybe you're better off without the LRC plugin with the limited number of hosts. Instead you could use the jerasure plugin with a profile like k=4 m=5 allowing an entire DC to fail without losing data access (we have one customer using that).

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

disclaimer: I haven't used LRC in a real setup yet, so there might be some misunderstandings on my side. But I tried to play around with one of my test clusters (Nautilus). Because I'm limited in the number of hosts (6 across 3 virtual DCs) I tried two different profiles with lower numbers to get a feeling for how that works.

# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=4 m=2 l=3 crush-failure-domain=host

For every third OSD one parity chunk is added, so 2 more chunks to store ==> 8 chunks in total. Since my failure-domain is host and I only have 6 I get incomplete PGs.

# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc k=2 m=2 l=2 crush-failure-domain=host

This gives me 6 chunks in total to store across 6 hosts which works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE        SINCE VERSION REPORTED UP                    ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 50.0       1        0         0       0   619 0          0   1 active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27 [27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.1       0        0         0       0     0 0          0   0 active+clean    6m     0'0 18414:26 [27,33,22,6,13,34]p27 [27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.2       0        0         0       0     0 0          0   0 active+clean    6m     0'0 18413:25 [1,28,14,4,31,21]p1 [1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135 50.3       0        0         0       0     0 0          0   0 active+clean    6m     0'0 18413:24 [8,16,26,33,7,25]p8 [8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02 14:53:54.322135

After stopping all OSDs on one host I was still able to read and write into the pool, but after stopping a second host one PG from that pool went "down". That I don't fully understand yet, but I just started to look into it. With your setup (12 hosts) I would recommend to not utilize all of them so you have capacity to recover, let's say one "spare" host per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 could make sense here, resulting in 9 total chunks (one more parity chunks for every other OSD), min_size 4. But as I wrote, it probably doesn't have the resiliency for a DC failure, so that needs some further investigation.

Regards,
Eugen

Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:

Hi,

No... our current setup is 3 datacenters with the same configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs each. Thus the total of 12 OSDs servers. As with LRC plugin, k+m must be a multiple of l, I found that k=9/m=66/l=5 with crush-locality=datacenter was achieving my goal of being resilient to a datacenter failure. Because I had this, I considered that lowering the crush failure domain to osd was not a major issue in my case (as it would not be worst than a datacenter failure if all the shards are on the same server in a datacenter) and was working around the lack of hosts for k=9/m=6 (15 OSDs).

May be it helps, if I give the erasure code profile used:

crush-device-class=hdd
crush-failure-domain=osd
crush-locality=datacenter
crush-root=default
k=9
l=5
m=6
plugin=lrc

The previously mentioned strange number for min_size for the pool created with this profile has vanished after Quincy upgrade as this parameter is no longer in the CRUH map rule! and the `ceph osd pool get` command reports the expected number (10):

---------

ceph osd pool get fink-z1.rgw.buckets.data min_size
min_size: 10
--------

Cheers,

Michel

Le 29/04/2023 à 20:36, Curt a écrit :
Hello,

What is your current setup, 1 server pet data center with 12 osd each? What is your current crush rule and LRC crush rule?


On Fri, Apr 28, 2023, 12:29 Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx> wrote:

  Hi,

  I think I found a possible cause of my PG down but still
  understand why.
  As explained in a previous mail, I setup a 15-chunk/OSD EC pool (k=9,
  m=6) but I have only 12 OSD servers in the cluster. To workaround the
  problem I defined the failure domain as 'osd' with the reasoning
  that as
  I was using the LRC plugin, I had the warranty that I could loose
  a site
  without impact, thus the possibility to loose 1 OSD server. Am I
  wrong?

  Best regards,

  Michel

  Le 24/04/2023 à 13:24, Michel Jouvin a écrit :
  > Hi,
  >
  > I'm still interesting by getting feedback from those using the LRC
  > plugin about the right way to configure it... Last week I upgraded
  > from Pacific to Quincy (17.2.6) with cephadm which is doing the
  > upgrade host by host, checking if an OSD is ok to stop before
  actually
  > upgrading it. I had the surprise to see 1 or 2 PGs down at some
  points
  > in the upgrade (happened not for all OSDs but for every
  > site/datacenter). Looking at the details with "ceph health
  detail", I
  > saw that for these PGs there was 3 OSDs down but I was expecting
  the
  > pool to be resilient to 6 OSDs down (5 for R/W access) so I'm
  > wondering if there is something wrong in our pool configuration
  (k=9,
  > m=6, l=5).
  >
  > Cheers,
  >
  > Michel
  >
  > Le 06/04/2023 à 08:51, Michel Jouvin a écrit :
  >> Hi,
  >>
  >> Is somebody using LRC plugin ?
  >>
  >> I came to the conclusion that LRC  k=9, m=3, l=4 is not the
  same as
  >> jerasure k=9, m=6 in terms of protection against failures and
  that I
  >> should use k=9, m=6, l=5 to get a level of resilience >= jerasure
  >> k=9, m=6. The example in the documentation (k=4, m=2, l=3)
  suggests
  >> that this LRC configuration gives something better than
  jerasure k=4,
  >> m=2 as it is resilient to 3 drive failures (but not 4 if I
  understood
  >> properly). So how many drives can fail in the k=9, m=6, l=5
  >> configuration first without loosing RW access and second without
  >> loosing data?
  >>
  >> Another thing that I don't quite understand is that a pool created
  >> with this configuration (and failure domain=osd,
  locality=datacenter)
  >> has a min_size=3 (max_size=18 as expected). It seems wrong to
  me, I'd
  >> expected something ~10 (depending on answer to the previous
  question)...
  >>
  >> Thanks in advance if somebody could provide some sort of
  >> authoritative answer on these 2 questions. Best regards,
  >>
  >> Michel
  >>
  >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
  >>> Answering to myself, I found the reason for 2147483647: it's
  >>> documented as a failure to find enough OSD (missing OSDs). And
  it is
  >>> normal as I selected different hosts for the 15 OSDs but I
  have only
  >>> 12 hosts!
  >>>
  >>> I'm still interested by an "expert" to confirm that LRC  k=9,
  m=3,
  >>> l=4 configuration is equivalent, in terms of redundancy, to a
  >>> jerasure configuration with k=9, m=6.
  >>>
  >>> Michel
  >>>
  >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit :
  >>>> Hi,
  >>>>
  >>>> As discussed in another thread (Crushmap rule for
  multi-datacenter
  >>>> erasure coding), I'm trying to create an EC pool spanning 3
  >>>> datacenters (datacenters are present in the crushmap), with the
  >>>> objective to be resilient to 1 DC down, at least keeping the
  >>>> readonly access to the pool and if possible the read-write
  access,
  >>>> and have a storage efficiency better than 3 replica (let say a
  >>>> storage overhead <= 2).
  >>>>
  >>>> In the discussion, somebody mentioned LRC plugin as a possible
  >>>> jerasure alternative to implement this without tweaking the
  >>>> crushmap rule to implement the 2-step OSD allocation. I
  looked at
  >>>> the documentation
  >>>>
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/)
  >>>> but I have some questions if someone has experience/expertise
  with
  >>>> this LRC plugin.
  >>>>
  >>>> I tried to create a rule for using 5 OSDs per datacenter (15 in
  >>>> total), with 3 (9 in total) being data chunks and others being
  >>>> coding chunks. For this, based of my understanding of
  examples, I
  >>>> used k=9, m=3, l=4. Is it right? Is this configuration
  equivalent,
  >>>> in terms of redundancy, to a jerasure configuration with k=9,
  m=6?
  >>>>
  >>>> The resulting rule, which looks correct to me, is:
  >>>>
  >>>> --------
  >>>>
  >>>> {
  >>>>     "rule_id": 6,
  >>>>     "rule_name": "test_lrc_2",
  >>>>     "ruleset": 6,
  >>>>     "type": 3,
  >>>>     "min_size": 3,
  >>>>     "max_size": 15,
  >>>>     "steps": [
  >>>>         {
  >>>>             "op": "set_chooseleaf_tries",
  >>>>             "num": 5
  >>>>         },
  >>>>         {
  >>>>             "op": "set_choose_tries",
  >>>>             "num": 100
  >>>>         },
  >>>>         {
  >>>>             "op": "take",
  >>>>             "item": -4,
  >>>>             "item_name": "default~hdd"
  >>>>         },
  >>>>         {
  >>>>             "op": "choose_indep",
  >>>>             "num": 3,
  >>>>             "type": "datacenter"
  >>>>         },
  >>>>         {
  >>>>             "op": "chooseleaf_indep",
  >>>>             "num": 5,
  >>>>             "type": "host"
  >>>>         },
  >>>>         {
  >>>>             "op": "emit"
  >>>>         }
  >>>>     ]
  >>>> }
  >>>>
  >>>> ------------
  >>>>
  >>>> Unfortunately, it doesn't work as expected: a pool created with
  >>>> this rule ends up with its pages active+undersize, which is
  >>>> unexpected for me. Looking at 'ceph health detail` output, I see
  >>>> for each page something like:
  >>>>
  >>>> pg 52.14 is stuck undersized for 27m, current state
  >>>> active+undersized, last acting
  >>>>
[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]   >>>>
  >>>> For each PG, there is 3 '2147483647' entries and I guess it
  is the
  >>>> reason of the problem. What are these entries about? Clearly
  it is
  >>>> not OSD entries... Looks like a negative number, -1, which in
  terms
  >>>> of crushmap ID is the crushmap root (named "default" in our
  >>>> configuration). Any trivial mistake I would have made?
  >>>>
  >>>> Thanks in advance for any help or for sharing any successful
  >>>> configuration?
  >>>>
  >>>> Best regards,
  >>>>
  >>>> Michel
  >>>> _______________________________________________
  >>>> ceph-users mailing list -- ceph-users@xxxxxxx
  >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
  _______________________________________________
  ceph-users mailing list -- ceph-users@xxxxxxx
  To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux