Re: Is there any way to obtain the maximum number of node failure in ceph without data loss?

Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Mon, 26 Jul 2021 16:22:29 -0600

Hi Jerry,

I think this is one of those "there must be something else going on
here" situations; marking any OSD out should affect only that one
"slot" in the acting set, at least until backfill completes (and in my
experience has always been the case). It might be worth inspecting the
cluster log on your mons to see if any additional OSDs are flapping
(going down briefly) during this process, as that could cause them to
drop out of the acting set until backfills complete.

There is quite a bit of shuffling of data going on when you fail an
OSD, and that might just be because of the width of your EC profile
given your cluster size and CRUSH rules (I believe that the
'chooseleaf' bit there is involved with that reshuffling, since EC
chunks will be moved around across hosts when an OSD is marked out in
your current configuration).

Unfortunately, that's probably the extent that I can help you, both
because of reaching close to the limit of my understanding of CRUSH
rules in this sort of configuration and because I'll be OOO for a bit
soon. :) Hopefully others can chime in with further ideas.

Josh

On Mon, Jul 26, 2021 at 2:45 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
>
> After doing more experiments, the outcome answer some of my questions:
>
> The environment is kind of different compared to the one mentioned in
> previous mail.
> 1) the `ceph osd tree`
>  -2         2.06516  root perf_osd
>  -5         0.67868      host jceph-n2-perf_osd
>   2    ssd  0.17331          osd.2                          up
> 1.00000  1.00000
>   3    ssd  0.15875          osd.3                          up
> 1.00000  1.00000
>   4    ssd  0.17331          osd.4                          up
> 1.00000  1.00000
>   5    ssd  0.17331          osd.5                          up
> 1.00000  1.00000
> -25         0.69324      host Jceph-n1-perf_osd
>   8    ssd  0.17331          osd.8                          up
> 1.00000  1.00000
>   9    ssd  0.17331          osd.9                          up
> 1.00000  1.00000
>  10    ssd  0.17331          osd.10                         up
> 1.00000  1.00000
>  11    ssd  0.17331          osd.11                         up
> 1.00000  1.00000
> -37         0.69324      host Jceph-n3-perf_osd
>  14    ssd  0.17331          osd.14                         up
> 1.00000  1.00000
>  15    ssd  0.17331          osd.15                         up
> 1.00000  1.00000
>  16    ssd  0.17331          osd.16                         up
> 1.00000  1.00000
>  17    ssd  0.17331          osd.17                         up
> 1.00000  1.00000
>
> 2) the used CRUSH rule for the EC8+3 pool for which the OSDs are
> selected by 'osd' instead.
> # ceph osd crush rule dump erasure_ruleset_by_osd
> {
>     "rule_id": 9,
>     "rule_name": "erasure_ruleset_by_osd",
>     "ruleset": 9,
>     "type": 3,
>     "min_size": 1,
>     "max_size": 16,
>     "steps": [
>         {
>             "op": "take",
>             "item": -2,
>             "item_name": "perf_osd"
>         },
>         {
>             "op": "choose_indep",
>             "num": 0,
>             "type": "osd"
>         },
>         {
>             "op": "emit"
>         }
>     ]
> }
>
> 3) the erasure-code-profile used to create the EC8+3 pool (min_size = 8)
> # ceph osd erasure-code-profile get jec_8_3
> crush-device-class=ssd
> crush-failure-domain=osd
> crush-root=perf_ssd
> k=8
> m=3
> plugin=isa
> technique=reed_sol_van
>
> The following consequence of acting set after unplugging only 2 OSDs:
>
> T0:
> [3,9,10,5,16,14,8,11,2,4,15]
>
> T1: after issuing `ceph osd out 17`
> [NONE,NONE,10,5,16,14,8,11,2,4,NONE]
> state of this PG: "active+recovery_wait+undersized+degraded+remapped"
>
> T2: before recovery finishes, issuing `ceph osd out 11`
> [NONE,NONE,10,5,16,14,8,NONE,2,4,NONE]
> state of this PG: "down+remapped"
> comment: "not enough up instances of this PG to go active"
>
> With only 2 OSDs out, a PG of the EC8+3 pool enters "down+remapped"
> state.  So, it seems that the min_size of a erasure coded K+M pool
> should be set to K+1 which ensures that the data is intact even one
> more extra OSD is broken during recovery, although the pool may not
> serve IO.
>
> Any feedback and ideas are welcomed and appreciated!
>
> - Jerry
>
> On Mon, 26 Jul 2021 at 11:33, Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
> >
> > Hello Josh,
> >
> > I simulated the osd.14 failure by the following steps:
> >    1. hot unplug the disk
> >    2. systemctl stop ceph-osd@14
> >    3. ceph osd out 14
> >
> > The used CRUSH rule to create the EC8+3 pool is described as below:
> > # ceph osd crush rule dump erasure_hdd_mhosts
> > {
> >     "rule_id": 8,
> >     "rule_name": "erasure_hdd_mhosts",
> >     "ruleset": 8,
> >     "type": 3,
> >     "min_size": 1,
> >     "max_size": 16,
> >     "steps": [
> >         {
> >             "op": "take",
> >             "item": -1,
> >             "item_name": "default"
> >         },
> >         {
> >             "op": "chooseleaf_indep",
> >             "num": 0,
> >             "type": "host"
> >         },
> >         {
> >             "op": "emit"
> >         }
> >     ]
> > }
> >
> > And the output of `ceph osd tree` is also attached:
> > [~] # ceph osd tree
> > ID   CLASS  WEIGHT   TYPE NAME                          STATUS
> > REWEIGHT  PRI-AFF
> >  -1        32.36148  root default
> > -13         2.69679      host jceph-n01
> >   0    hdd  0.89893          osd.0                          up
> > 1.00000  1.00000
> >   1    hdd  0.89893          osd.1                          up
> > 1.00000  1.00000
> >   2    hdd  0.89893          osd.2                          up
> > 1.00000  1.00000
> > -17         2.69679      host jceph-n02
> >   3    hdd  0.89893          osd.3                          up
> > 1.00000  1.00000
> >   4    hdd  0.89893          osd.4                          up
> > 1.00000  1.00000
> >   5    hdd  0.89893          osd.5                          up
> > 1.00000  1.00000
> > -21         2.69679      host jceph-n03
> >   6    hdd  0.89893          osd.6                          up
> > 1.00000  1.00000
> >   7    hdd  0.89893          osd.7                          up
> > 1.00000  1.00000
> >   8    hdd  0.89893          osd.8                          up
> > 1.00000  1.00000
> > -25         2.69679      host jceph-n04
> >   9    hdd  0.89893          osd.9                          up
> > 1.00000  1.00000
> >  10    hdd  0.89893          osd.10                         up
> > 1.00000  1.00000
> >  11    hdd  0.89893          osd.11                         up
> > 1.00000  1.00000
> > -29         2.69679      host jceph-n05
> >  12    hdd  0.89893          osd.12                         up
> > 1.00000  1.00000
> >  13    hdd  0.89893          osd.13                         up
> > 1.00000  1.00000
> >  14    hdd  0.89893          osd.14                         up
> > 1.00000  1.00000
> > -33         2.69679      host jceph-n06
> >  15    hdd  0.89893          osd.15                         up
> > 1.00000  1.00000
> >  16    hdd  0.89893          osd.16                         up
> > 1.00000  1.00000
> >  17    hdd  0.89893          osd.17                         up
> > 1.00000  1.00000
> > -37         2.69679      host jceph-n07
> >  18    hdd  0.89893          osd.18                         up
> > 1.00000  1.00000
> >  19    hdd  0.89893          osd.19                         up
> > 1.00000  1.00000
> >  20    hdd  0.89893          osd.20                         up
> > 1.00000  1.00000
> > -41         2.69679      host jceph-n08
> >  21    hdd  0.89893          osd.21                         up
> > 1.00000  1.00000
> >  22    hdd  0.89893          osd.22                         up
> > 1.00000  1.00000
> >  23    hdd  0.89893          osd.23                         up
> > 1.00000  1.00000
> > -45         2.69679      host jceph-n09
> >  24    hdd  0.89893          osd.24                         up
> > 1.00000  1.00000
> >  25    hdd  0.89893          osd.25                         up
> > 1.00000  1.00000
> >  26    hdd  0.89893          osd.26                         up
> > 1.00000  1.00000
> > -49         2.69679      host jceph-n10
> >  27    hdd  0.89893          osd.27                         up
> > 1.00000  1.00000
> >  28    hdd  0.89893          osd.28                         up
> > 1.00000  1.00000
> >  29    hdd  0.89893          osd.29                         up
> > 1.00000  1.00000
> > -53         2.69679      host jceph-n11
> >  30    hdd  0.89893          osd.30                         up
> > 1.00000  1.00000
> >  31    hdd  0.89893          osd.31                         up
> > 1.00000  1.00000
> >  32    hdd  0.89893          osd.32                         up
> > 1.00000  1.00000
> > -57         2.69679      host jceph-n12
> >  33    hdd  0.89893          osd.33                         up
> > 1.00000  1.00000
> >  34    hdd  0.89893          osd.34                         up
> > 1.00000  1.00000
> >  35    hdd  0.89893          osd.35                         up
> > 1.00000  1.00000
> >
> > Thanks for your help.
> >
> > - Jerry
> >
> > On Fri, 23 Jul 2021 at 22:40, Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> wrote:
> > >
> > > Hi Jerry,
> > >
> > > In general, your CRUSH rules should define the behaviour you're
> > > looking for. Based on what you've stated about your configuration,
> > > after failing a single node or an OSD on a single node, then you
> > > should still be able to tolerate two more failures in the system
> > > without losing data (or losing access to data, given that min_size=k,
> > > though I believe it's recommended to set min_size=k+1).
> > >
> > > However, that sequence of acting sets doesn't make a whole lot of
> > > sense to me for a single OSD failure (though perhaps I'm misreading
> > > them). Can you clarify exactly how you simulated the osd.14 failure?
> > > It might also be helpful to post your CRUSH rule and "ceph osd tree".
> > >
> > > Josh
> > >
> > > On Fri, Jul 23, 2021 at 1:42 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I would like to know the maximum number of node failures for a EC8+3
> > > > pool in a 12-node cluster with 3 OSDs in each node.  The size and
> > > > min_size of the EC8+3 pool is configured as 11 and 8, and OSDs of each
> > > > PG are selected by host.  When there is no node failure, the maximum
> > > > number of node failures is 3, right?  After unplugging a OSD (osd.14)
> > > > in the cluster, I check the PG acting set changes and one of the
> > > > results is shown as below:
> > > >
> > > > T0:
> > > > [15,31,11,34,28,1,8,26,14,19,5]
> > > >
> > > > T1: after unplugging a OSD (osd.14) and recovery started
> > > > [15,31,11,34,28,1,8,26,NONE,19,5]
> > > >
> > > > T2:
> > > > [15,31,11,34,21,1,8,26,19,29,5]
> > > >
> > > > T3:
> > > > [15,31,11,34,NONE,1,8,26,NONE,NONE,5]
> > > >
> > > > T4: recovery was done
> > > > [15,31,11,34,21,1,8,26,19,29,5]
> > > >
> > > > For the PG, 3 OSD peers changed during the recovery progress
> > > > ([_,_,_,_,28->21,_,_,_,14->19,19->29,_]).  It seems that min_size (8)
> > > > of chunks of the EC8+3 pool are kept during recovery.  Does it mean
> > > > that no more node failures are bearable during T3 to T4?  Can we
> > > > calculate the maximum number of node failures by examining all the
> > > > acting sets of the PGs?  Is there some simple way to obtain such
> > > > information?  Any ideas and feedback are appreciated, thanks!
> > > >
> > > > - Jerry
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx