Re: Help needed to configure erasure coding LRC plugin

Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx> · Mon, 19 Jun 2023 16:48:48 +0200

Hi Eugen,

Thank you very much for these detailed tests that match what I observed 
and reported earlier. I'm happy to see that we have the same 
understanding of how it should work (based on the documentation). Is 
there any other way that this list to enter in contact with the plugin 
developers as it seems they are not following this (very high volume) 
list... Or may somebody pass the email thread to one of them?

Help would be really appreciated. Cheers,

Michel

Le 19/06/2023 à 14:09, Eugen Block a écrit :
Hi, I have a real hardware cluster for testing available now. I'm not 
sure whether I'm completely misunderstanding how it's supposed to work 
or if it's a bug in the LRC plugin.
This cluster has 18 HDD nodes available across 3 rooms (or DCs), I 
intend to use 15 nodes to be able to recover if one node fails.
Given that I need one additional locality chunk per DC I need a 
profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15 
chunks in total across those 3 DCs, one chunk per host, I checked the 
chunk placement and it is correct. This is the profile I created:

ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4 
crush-failure-domain=host crush-locality=room crush-device-class=hdd

I created a pool with only one PG to make the output more readable.

This profile should allow the cluster to sustain the loss of three 
chunks, the results are interesting. This is what I tested:

1. I stopped all OSDs on one host and the PG was still active with one 
missing chunk, everything's good.
2. Stopping a second host in the same DC resulted in the PG being 
marked as "down". That was unexpected since with m=3 I expected the PG 
to still be active but degraded. Before test #3 I started all OSDs to 
have the PG active+clean again.
3. I stopped one host per DC, so in total 3 chunks were missing and 
the PG was still active.

Apparently, this profile is able to sustain the loss of m chunks, but 
not an entire DC. I get the impression (and I also discussed this with 
a colleague) that LRC with this implementation is either designed to 
loose only single OSDs which can be recovered quicker with fewer 
surviving OSDs and saving bandwidth. Or this is a bug because 
according to the low-level description [1] the algorithm works its way 
up in the reverse order within the configured layers, like in this 
example (not displaying my k, m, l requirements, just for reference):

chunk nr    01234567
step 1      _cDD_cDD
step 2      cDDD____
step 3      ____cDDD

So if a whole DC fails and the chunks from step 3 can not be 
recovered, and maybe step 2 also fails, but eventually step 1 contains 
the actual k and m chunks which should sustain the loss of an entire 
DC. My impression is that the algorithm somehow doesn't arrive at step 
1 and therefore the PG stays down although there are enough surviving 
chunks. I'm not sure if my observations and conclusion are correct, 
I'd love to have a comment from the developers on this topic. But in 
this state I would not recommend to use the LRC plugin when the 
resiliency requirements are to sustain the loss of an entire DC.

Thanks,
Eugen

[1] 
https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration

Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:

Hi,

 I realize that the crushmap I attached to one of my email, probably 
required to understand the discussion here, has been stripped down by 
mailman. To avoid poluting the thread with a long output, I put it on 
at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if 
you are interested.

Best regards,

Michel

Le 21/05/2023 à 16:07, Michel Jouvin a écrit :
Hi Eugen,

My LRC pool is also somewhat experimental so nothing really urgent. 
If you manage to do some tests that help me to understand the 
problem I remain interested. I propose to keep this thread for that.

Zitat, I shared my crush map in the email you answered if the 
attachment was not suppressed by mailman.

Cheers,

Michel
Sent from my mobile

Le 18 mai 2023 11:19:35 Eugen Block <eblock@xxxxxx> a écrit :

Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt <lightspd@xxxxxxxxx>:

Hi,

I've been following this thread with interest as it seems like a 
unique use
case to expand my knowledge. I don't use LRC or anything outside 
basic
erasure coding.

What is your current crush steps rule?  I know you made changes 
since your
first post and had some thoughts I wanted to share, but wanted to 
see your
rule first so I could try to visualize the distribution better. 
 The only
way I can currently visualize it working is with more servers, I'm 
thinking
6 or 9 per data center min, but that could be my lack of knowledge 
on some
of the step rules.

Thanks
Curt

On Tue, May 16, 2023 at 11:09 AM Michel Jouvin <
michel.jouvin@xxxxxxxxxxxxxxx> wrote:

Hi Eugen,

Yes, sure, no problem to share it. I attach it to this email (as 
it may
clutter the discussion if inline).

If somebody on the list has some clue on the LRC plugin, I'm still
interested by understand what I'm doing wrong!

Cheers,

Michel

Le 04/05/2023 à 15:07, Eugen Block a écrit :
Hi,

I don't think you've shared your osd tree yet, could you do that?
Apparently nobody else but us reads this thread or nobody 
reading this
uses the LRC plugin. ;-)

Thanks,
Eugen

Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:

Hi,

I had to restart one of my OSD server today and the problem 
showed up
again. This time I managed to capture "ceph health detail" output
showing the problem with the 2 PGs:

[WRN] PG_AVAILABILITY: Reduced data availability: 2 pgs 
inactive, 2
pgs down
pg 56.1 is down, acting
[208,65,73,206,197,193,144,155,178,182,183,133,17,NONE,36,NONE,230,NONE] 

pg 56.12 is down, acting

[NONE,236,28,228,218,NONE,215,117,203,213,204,115,136,181,171,162,137,128] 

I still doesn't understand why, if I am supposed to survive to a
datacenter failure, I cannot survive to 3 OSDs down on the same 
host,
hosting shards for the PG. In the second case it is only 2 OSDs 
down
but I'm surprised they don't seem in the same "group" of OSD (I'd
expected all the the OSDs of one datacenter to be in the same 
groupe
of 5 if the order given really reflects the allocation done...

Still interested by some explanation on what I'm doing wrong! Best
regards,

Michel

Le 03/05/2023 à 10:21, Eugen Block a écrit :
I think I got it wrong with the locality setting, I'm still 
limited
by the number of hosts I have available in my test cluster, 
but as
far as I got with failure-domain=osd I believe k=6, m=3, l=3 with
locality=datacenter could fit your requirement, at least with
regards to the recovery bandwidth usage between DCs, but the
resiliency would not match your requirement (one DC failure). 
That
profile creates 3 groups of 4 chunks (3 data/coding chunks and 
one
parity chunk) across three DCs, in total 12 chunks. The 
min_size=7
would not allow an entire DC to go down, I'm afraid, you'd 
have to
reduce it to 6 to allow reads/writes in a disaster scenario. I'm
still not sure if I got it right this time, but maybe you're 
better
off without the LRC plugin with the limited number of hosts. 
Instead
you could use the jerasure plugin with a profile like k=4 m=5
allowing an entire DC to fail without losing data access (we have
one customer using that).

Zitat von Eugen Block <eblock@xxxxxx>:

Hi,

disclaimer: I haven't used LRC in a real setup yet, so there 
might
be some misunderstandings on my side. But I tried to play around
with one of my test clusters (Nautilus). Because I'm limited 
in the
number of hosts (6 across 3 virtual DCs) I tried two different
profiles with lower numbers to get a feeling for how that works.

# first attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
k=4 m=2 l=3 crush-failure-domain=host

For every third OSD one parity chunk is added, so 2 more 
chunks to
store ==> 8 chunks in total. Since my failure-domain is host 
and I
only have 6 I get incomplete PGs.

# second attempt
ceph:~ # ceph osd erasure-code-profile set LRCprofile plugin=lrc
k=2 m=2 l=2 crush-failure-domain=host

This gives me 6 chunks in total to store across 6 hosts which 
works:

ceph:~ # ceph pg ls-by-pool lrcpool
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES*
OMAP_KEYS* LOG STATE  SINCE VERSION REPORTED
UP                    ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
50.0       1        0         0       0   619 0          0 1
active+clean   72s 18410'1 18415:54 [27,13,0,2,25,7]p27
[27,13,0,2,25,7]p27 2023-05-02 14:53:54.322135 2023-05-02
14:53:54.322135
50.1       0        0         0       0     0 0          0 0
active+clean    6m     0'0 18414:26 [27,33,22,6,13,34]p27
[27,33,22,6,13,34]p27 2023-05-02 14:53:54.322135 2023-05-02
14:53:54.322135
50.2       0        0         0       0     0 0          0 0
active+clean    6m     0'0 18413:25 [1,28,14,4,31,21]p1
[1,28,14,4,31,21]p1 2023-05-02 14:53:54.322135 2023-05-02
14:53:54.322135
50.3       0        0         0       0     0 0          0 0
active+clean    6m     0'0 18413:24 [8,16,26,33,7,25]p8
[8,16,26,33,7,25]p8 2023-05-02 14:53:54.322135 2023-05-02
14:53:54.322135

After stopping all OSDs on one host I was still able to read and
write into the pool, but after stopping a second host one PG 
from
that pool went "down". That I don't fully understand yet, but I
just started to look into it.
With your setup (12 hosts) I would recommend to not utilize 
all of
them so you have capacity to recover, let's say one "spare" host
per DC, leaving 9 hosts in total. A profile with k=3 m=3 l=2 
could
make sense here, resulting in 9 total chunks (one more parity
chunks for every other OSD), min_size 4. But as I wrote, it
probably doesn't have the resiliency for a DC failure, so that
needs some further investigation.

Regards,
Eugen

Zitat von Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx>:

Hi,

No... our current setup is 3 datacenters with the same
configuration, i.e. 1 mon/mgr + 4 OSD servers with 16 OSDs 
each.
Thus the total of 12 OSDs servers. As with LRC plugin, k+m 
must be
a multiple of l, I found that k=9/m=66/l=5 with
crush-locality=datacenter was achieving my goal of being 
resilient
to a datacenter failure. Because I had this, I considered that
lowering the crush failure domain to osd was not a major 
issue in
my case (as it would not be worst than a datacenter failure 
if all
the shards are on the same server in a datacenter) and was 
working
around the lack of hosts for k=9/m=6 (15 OSDs).

May be it helps, if I give the erasure code profile used:

crush-device-class=hdd
crush-failure-domain=osd
crush-locality=datacenter
crush-root=default
k=9
l=5
m=6
plugin=lrc

The previously mentioned strange number for min_size for the 
pool
created with this profile has vanished after Quincy upgrade as
this parameter is no longer in the CRUH map rule! and the `ceph
osd pool get` command reports the expected number (10):

---------

ceph osd pool get fink-z1.rgw.buckets.data min_size
min_size: 10
--------

Cheers,

Michel

Le 29/04/2023 à 20:36, Curt a écrit :
Hello,

What is your current setup, 1 server pet data center with 
12 osd
each? What is your current crush rule and LRC crush rule?

On Fri, Apr 28, 2023, 12:29 Michel Jouvin
<michel.jouvin@xxxxxxxxxxxxxxx> wrote:

  Hi,

  I think I found a possible cause of my PG down but still
  understand why.
  As explained in a previous mail, I setup a 15-chunk/OSD 
EC pool
(k=9,
  m=6) but I have only 12 OSD servers in the cluster. To
workaround the
  problem I defined the failure domain as 'osd' with the 
reasoning
  that as
  I was using the LRC plugin, I had the warranty that I 
could loose
  a site
  without impact, thus the possibility to loose 1 OSD 
server. Am I
  wrong?

  Best regards,

  Michel

  Le 24/04/2023 à 13:24, Michel Jouvin a écrit :
  > Hi,
  >
  > I'm still interesting by getting feedback from those using
the LRC
  > plugin about the right way to configure it... Last week I
upgraded
  > from Pacific to Quincy (17.2.6) with cephadm which is 
doing the
  > upgrade host by host, checking if an OSD is ok to stop 
before
  actually
  > upgrading it. I had the surprise to see 1 or 2 PGs down 
at some
  points
  > in the upgrade (happened not for all OSDs but for every
  > site/datacenter). Looking at the details with "ceph health
  detail", I
  > saw that for these PGs there was 3 OSDs down but I was 
expecting
  the
  > pool to be resilient to 6 OSDs down (5 for R/W access) 
so I'm
  > wondering if there is something wrong in our pool 
configuration
  (k=9,
  > m=6, l=5).
  >
  > Cheers,
  >
  > Michel
  >
  > Le 06/04/2023 à 08:51, Michel Jouvin a écrit :
  >> Hi,
  >>
  >> Is somebody using LRC plugin ?
  >>
  >> I came to the conclusion that LRC  k=9, m=3, l=4 is 
not the
  same as
  >> jerasure k=9, m=6 in terms of protection against 
failures and
  that I
  >> should use k=9, m=6, l=5 to get a level of resilience >=
jerasure
  >> k=9, m=6. The example in the documentation (k=4, m=2, 
l=3)
  suggests
  >> that this LRC configuration gives something better than
  jerasure k=4,
  >> m=2 as it is resilient to 3 drive failures (but not 4 
if I
  understood
  >> properly). So how many drives can fail in the k=9, 
m=6, l=5
  >> configuration first without loosing RW access and second
without
  >> loosing data?
  >>
  >> Another thing that I don't quite understand is that a 
pool
created
  >> with this configuration (and failure domain=osd,
  locality=datacenter)
  >> has a min_size=3 (max_size=18 as expected). It seems 
wrong to
  me, I'd
  >> expected something ~10 (depending on answer to the 
previous
  question)...
  >>
  >> Thanks in advance if somebody could provide some sort of
  >> authoritative answer on these 2 questions. Best regards,
  >>
  >> Michel
  >>
  >> Le 04/04/2023 à 15:53, Michel Jouvin a écrit :
  >>> Answering to myself, I found the reason for 
2147483647: it's
  >>> documented as a failure to find enough OSD (missing 
OSDs). And
  it is
  >>> normal as I selected different hosts for the 15 OSDs 
but I
  have only
  >>> 12 hosts!
  >>>
  >>> I'm still interested by an "expert" to confirm that 
LRC  k=9,
  m=3,
  >>> l=4 configuration is equivalent, in terms of 
redundancy, to a
  >>> jerasure configuration with k=9, m=6.
  >>>
  >>> Michel
  >>>
  >>> Le 04/04/2023 à 15:26, Michel Jouvin a écrit :
  >>>> Hi,
  >>>>
  >>>> As discussed in another thread (Crushmap rule for
  multi-datacenter
  >>>> erasure coding), I'm trying to create an EC pool 
spanning 3
  >>>> datacenters (datacenters are present in the crushmap),
with the
  >>>> objective to be resilient to 1 DC down, at least 
keeping the
  >>>> readonly access to the pool and if possible the 
read-write
  access,
  >>>> and have a storage efficiency better than 3 replica 
(let
say a
  >>>> storage overhead <= 2).
  >>>>
  >>>> In the discussion, somebody mentioned LRC plugin as a
possible
  >>>> jerasure alternative to implement this without 
tweaking the
  >>>> crushmap rule to implement the 2-step OSD allocation. I
  looked at
  >>>> the documentation
  >>>>
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/ 

)
  >>>> but I have some questions if someone has 
experience/expertise
  with
  >>>> this LRC plugin.
  >>>>
  >>>> I tried to create a rule for using 5 OSDs per 
datacenter
(15 in
  >>>> total), with 3 (9 in total) being data chunks and 
others
being
  >>>> coding chunks. For this, based of my understanding of
  examples, I
  >>>> used k=9, m=3, l=4. Is it right? Is this configuration
  equivalent,
  >>>> in terms of redundancy, to a jerasure configuration 
with k=9,
  m=6?
  >>>>
  >>>> The resulting rule, which looks correct to me, is:
  >>>>
  >>>> --------
  >>>>
  >>>> {
  >>>> "rule_id": 6,
  >>>> "rule_name": "test_lrc_2",
  >>>> "ruleset": 6,
  >>>> "type": 3,
  >>>> "min_size": 3,
  >>>> "max_size": 15,
  >>>> "steps": [
  >>>>         {
  >>>>   "op": "set_chooseleaf_tries",
  >>>>   "num": 5
  >>>> },
  >>>>         {
  >>>>   "op": "set_choose_tries",
  >>>>   "num": 100
  >>>> },
  >>>>         {
  >>>>   "op": "take",
  >>>>   "item": -4,
  >>>>   "item_name": "default~hdd"
  >>>> },
  >>>>         {
  >>>>   "op": "choose_indep",
  >>>>   "num": 3,
  >>>>   "type": "datacenter"
  >>>> },
  >>>>         {
  >>>>   "op": "chooseleaf_indep",
  >>>>   "num": 5,
  >>>>   "type": "host"
  >>>> },
  >>>>         {
  >>>>   "op": "emit"
  >>>>         }
  >>>>     ]
  >>>> }
  >>>>
  >>>> ------------
  >>>>
  >>>> Unfortunately, it doesn't work as expected: a pool 
created
with
  >>>> this rule ends up with its pages active+undersize, 
which is
  >>>> unexpected for me. Looking at 'ceph health detail` 
output,
I see
  >>>> for each page something like:
  >>>>
  >>>> pg 52.14 is stuck undersized for 27m, current state
  >>>> active+undersized, last acting
  >>>>

[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647] 

  >>>>
  >>>> For each PG, there is 3 '2147483647' entries and I 
guess it
  is the
  >>>> reason of the problem. What are these entries about? 
Clearly
  it is
  >>>> not OSD entries... Looks like a negative number, -1, 
which in
  terms
  >>>> of crushmap ID is the crushmap root (named "default" 
in our
  >>>> configuration). Any trivial mistake I would have made?
  >>>>
  >>>> Thanks in advance for any help or for sharing any 
successful
  >>>> configuration?
  >>>>
  >>>> Best regards,
  >>>>
  >>>> Michel
  >>>> _______________________________________________
  >>>> ceph-users mailing list -- ceph-users@xxxxxxx
  >>>> To unsubscribe send an email to 
ceph-users-leave@xxxxxxx
_______________________________________________
  ceph-users mailing list -- ceph-users@xxxxxxx
  To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx