EC pool 4+2 - failed to guarantee a failure domain

Maks Kowalik <maks_kowalik@xxxxxxxxx> · Thu, 12 Mar 2020 12:58:27 +0100

Hello,

I have created a small 16pg EC pool with k=4, m=2.
Then I applied following crush rule to it:

rule test_ec {  	id 99  	type erasure  	min_size 5  	max_size 6  	step
set_chooseleaf_tries 5  	step set_choose_tries 100  	step take default
 	step choose indep 3 type host  	step chooseleaf indep 2 type osd
	step emit  }

The OSD tree looks as following:
-1       43.38448 root default
 -9       43.38448     region lab1
 -7       43.38448         room dc1.lab1
 -5       43.38448             rack r1.dc1.lab1
 -3       14.44896                 host host1.r1.dc1.lab1
  6   hdd  3.63689                     osd.6
   up  1.00000 1.00000
  8   hdd  3.63689                     osd.8
   up  1.00000 1.00000
  7   hdd  3.63689                     osd.7
   up  1.00000 1.00000
 11   hdd  3.53830                     osd.11
   up  1.00000 1.00000
-11       14.44896                 host host2.r1.dc1.lab1
  4   hdd  3.63689                     osd.4
   up  1.00000 1.00000
  9   hdd  3.63689                     osd.9
   up  1.00000 1.00000
  5   hdd  3.63689                     osd.5
   up  1.00000 1.00000
 10   hdd  3.53830                     osd.10
   up  1.00000 1.00000
-13       14.48656                 host host3.r1.dc1.lab1
  0   hdd  3.57590                     osd.0
   up  1.00000 1.00000
  1   hdd  3.63689                     osd.1
   up  1.00000 1.00000
  2   hdd  3.63689                     osd.2
   up  1.00000 1.00000
  3   hdd  3.63689                     osd.3
   up  1.00000 1.00000

My expectation was that each host will contain 2 shards of any PG of the pool.

When I dumped PGs, it was true, but one group is placed on OSDs 0,2,3
which will cause downtime in case of host3 failure.
root@host1:~/mkw # ceph pg dump|grep "^66\."|awk '{print $17}'
dumped all
[4,5,7,6,1,2]

[8,11,9,3,0,2]  <<< - this one is problematic

[6,7,10,9,2,0]
[2,3,7,6,5,9]
[7,8,10,5,3,1]
[4,5,8,6,0,2]
[7,11,9,4,1,2]
[5,9,0,2,7,11]
[9,5,3,1,7,8]
[8,11,2,0,5,9]
[2,0,8,6,10,9]
[3,2,5,9,7,11]
[6,7,9,5,1,2]
[10,5,1,3,11,8]
[4,5,7,8,2,0]
[7,8,3,2,9,10]

Is there a way to ensure that host failure is not disruptive to the cluster?

During the experiment I used info from this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030227.html

Kind regards,

Maks Kowalik
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx