Re: Is there any way to obtain the maximum number of node failure in ceph without data loss?

Jerry Lee <leisurelysw24@xxxxxxxxx> · Mon, 26 Jul 2021 11:33:39 +0800

Hello Josh,

I simulated the osd.14 failure by the following steps:
   1. hot unplug the disk
   2. systemctl stop ceph-osd@14
   3. ceph osd out 14

The used CRUSH rule to create the EC8+3 pool is described as below:
# ceph osd crush rule dump erasure_hdd_mhosts
{
    "rule_id": 8,
    "rule_name": "erasure_hdd_mhosts",
    "ruleset": 8,
    "type": 3,
    "min_size": 1,
    "max_size": 16,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_indep",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

And the output of `ceph osd tree` is also attached:
[~] # ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                          STATUS
REWEIGHT  PRI-AFF
 -1        32.36148  root default
-13         2.69679      host jceph-n01
  0    hdd  0.89893          osd.0                          up
1.00000  1.00000
  1    hdd  0.89893          osd.1                          up
1.00000  1.00000
  2    hdd  0.89893          osd.2                          up
1.00000  1.00000
-17         2.69679      host jceph-n02
  3    hdd  0.89893          osd.3                          up
1.00000  1.00000
  4    hdd  0.89893          osd.4                          up
1.00000  1.00000
  5    hdd  0.89893          osd.5                          up
1.00000  1.00000
-21         2.69679      host jceph-n03
  6    hdd  0.89893          osd.6                          up
1.00000  1.00000
  7    hdd  0.89893          osd.7                          up
1.00000  1.00000
  8    hdd  0.89893          osd.8                          up
1.00000  1.00000
-25         2.69679      host jceph-n04
  9    hdd  0.89893          osd.9                          up
1.00000  1.00000
 10    hdd  0.89893          osd.10                         up
1.00000  1.00000
 11    hdd  0.89893          osd.11                         up
1.00000  1.00000
-29         2.69679      host jceph-n05
 12    hdd  0.89893          osd.12                         up
1.00000  1.00000
 13    hdd  0.89893          osd.13                         up
1.00000  1.00000
 14    hdd  0.89893          osd.14                         up
1.00000  1.00000
-33         2.69679      host jceph-n06
 15    hdd  0.89893          osd.15                         up
1.00000  1.00000
 16    hdd  0.89893          osd.16                         up
1.00000  1.00000
 17    hdd  0.89893          osd.17                         up
1.00000  1.00000
-37         2.69679      host jceph-n07
 18    hdd  0.89893          osd.18                         up
1.00000  1.00000
 19    hdd  0.89893          osd.19                         up
1.00000  1.00000
 20    hdd  0.89893          osd.20                         up
1.00000  1.00000
-41         2.69679      host jceph-n08
 21    hdd  0.89893          osd.21                         up
1.00000  1.00000
 22    hdd  0.89893          osd.22                         up
1.00000  1.00000
 23    hdd  0.89893          osd.23                         up
1.00000  1.00000
-45         2.69679      host jceph-n09
 24    hdd  0.89893          osd.24                         up
1.00000  1.00000
 25    hdd  0.89893          osd.25                         up
1.00000  1.00000
 26    hdd  0.89893          osd.26                         up
1.00000  1.00000
-49         2.69679      host jceph-n10
 27    hdd  0.89893          osd.27                         up
1.00000  1.00000
 28    hdd  0.89893          osd.28                         up
1.00000  1.00000
 29    hdd  0.89893          osd.29                         up
1.00000  1.00000
-53         2.69679      host jceph-n11
 30    hdd  0.89893          osd.30                         up
1.00000  1.00000
 31    hdd  0.89893          osd.31                         up
1.00000  1.00000
 32    hdd  0.89893          osd.32                         up
1.00000  1.00000
-57         2.69679      host jceph-n12
 33    hdd  0.89893          osd.33                         up
1.00000  1.00000
 34    hdd  0.89893          osd.34                         up
1.00000  1.00000
 35    hdd  0.89893          osd.35                         up
1.00000  1.00000

Thanks for your help.

- Jerry

On Fri, 23 Jul 2021 at 22:40, Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> wrote:
>
> Hi Jerry,
>
> In general, your CRUSH rules should define the behaviour you're
> looking for. Based on what you've stated about your configuration,
> after failing a single node or an OSD on a single node, then you
> should still be able to tolerate two more failures in the system
> without losing data (or losing access to data, given that min_size=k,
> though I believe it's recommended to set min_size=k+1).
>
> However, that sequence of acting sets doesn't make a whole lot of
> sense to me for a single OSD failure (though perhaps I'm misreading
> them). Can you clarify exactly how you simulated the osd.14 failure?
> It might also be helpful to post your CRUSH rule and "ceph osd tree".
>
> Josh
>
> On Fri, Jul 23, 2021 at 1:42 AM Jerry Lee <leisurelysw24@xxxxxxxxx> wrote:
> >
> > Hello,
> >
> > I would like to know the maximum number of node failures for a EC8+3
> > pool in a 12-node cluster with 3 OSDs in each node.  The size and
> > min_size of the EC8+3 pool is configured as 11 and 8, and OSDs of each
> > PG are selected by host.  When there is no node failure, the maximum
> > number of node failures is 3, right?  After unplugging a OSD (osd.14)
> > in the cluster, I check the PG acting set changes and one of the
> > results is shown as below:
> >
> > T0:
> > [15,31,11,34,28,1,8,26,14,19,5]
> >
> > T1: after unplugging a OSD (osd.14) and recovery started
> > [15,31,11,34,28,1,8,26,NONE,19,5]
> >
> > T2:
> > [15,31,11,34,21,1,8,26,19,29,5]
> >
> > T3:
> > [15,31,11,34,NONE,1,8,26,NONE,NONE,5]
> >
> > T4: recovery was done
> > [15,31,11,34,21,1,8,26,19,29,5]
> >
> > For the PG, 3 OSD peers changed during the recovery progress
> > ([_,_,_,_,28->21,_,_,_,14->19,19->29,_]).  It seems that min_size (8)
> > of chunks of the EC8+3 pool are kept during recovery.  Does it mean
> > that no more node failures are bearable during T3 to T4?  Can we
> > calculate the maximum number of node failures by examining all the
> > acting sets of the PGs?  Is there some simple way to obtain such
> > information?  Any ideas and feedback are appreciated, thanks!
> >
> > - Jerry
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx