Re: EC pool 4+2 - failed to guarantee a failure domain

Eugen Block <eblock@xxxxxx> · Fri, 13 Mar 2020 09:19:17 +0000

Hi,

this is unexpected, of course, but it can happen if one OSD is full  
(or also nearfull?). Have you checked 'ceph osd df'? The pg  
availability has more priority than the placement, so it's possible  
that during a failure some chunks are recreated on the same OSD or  
host even if the crush rules shouldn't allow that.

Regards,
Eugen

Zitat von Maks Kowalik <maks_kowalik@xxxxxxxxx>:

Hello,

I have created a small 16pg EC pool with k=4, m=2.
Then I applied following crush rule to it:

rule test_ec {  	id 99  	type erasure  	min_size 5  	max_size 6  	step
set_chooseleaf_tries 5  	step set_choose_tries 100  	step take default
 	step choose indep 3 type host  	step chooseleaf indep 2 type osd
	step emit  }

The OSD tree looks as following:
-1       43.38448 root default
 -9       43.38448     region lab1
 -7       43.38448         room dc1.lab1
 -5       43.38448             rack r1.dc1.lab1
 -3       14.44896                 host host1.r1.dc1.lab1
  6   hdd  3.63689                     osd.6
   up  1.00000 1.00000
  8   hdd  3.63689                     osd.8
   up  1.00000 1.00000
  7   hdd  3.63689                     osd.7
   up  1.00000 1.00000
 11   hdd  3.53830                     osd.11
   up  1.00000 1.00000
-11       14.44896                 host host2.r1.dc1.lab1
  4   hdd  3.63689                     osd.4
   up  1.00000 1.00000
  9   hdd  3.63689                     osd.9
   up  1.00000 1.00000
  5   hdd  3.63689                     osd.5
   up  1.00000 1.00000
 10   hdd  3.53830                     osd.10
   up  1.00000 1.00000
-13       14.48656                 host host3.r1.dc1.lab1
  0   hdd  3.57590                     osd.0
   up  1.00000 1.00000
  1   hdd  3.63689                     osd.1
   up  1.00000 1.00000
  2   hdd  3.63689                     osd.2
   up  1.00000 1.00000
  3   hdd  3.63689                     osd.3
   up  1.00000 1.00000

My expectation was that each host will contain 2 shards of any PG of  
the pool.

When I dumped PGs, it was true, but one group is placed on OSDs 0,2,3
which will cause downtime in case of host3 failure.
root@host1:~/mkw # ceph pg dump|grep "^66\."|awk '{print $17}'
dumped all
[4,5,7,6,1,2]

[8,11,9,3,0,2]  <<< - this one is problematic

[6,7,10,9,2,0]
[2,3,7,6,5,9]
[7,8,10,5,3,1]
[4,5,8,6,0,2]
[7,11,9,4,1,2]
[5,9,0,2,7,11]
[9,5,3,1,7,8]
[8,11,2,0,5,9]
[2,0,8,6,10,9]
[3,2,5,9,7,11]
[6,7,9,5,1,2]
[10,5,1,3,11,8]
[4,5,7,8,2,0]
[7,8,3,2,9,10]

Is there a way to ensure that host failure is not disruptive to the cluster?

During the experiment I used info from this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030227.html

Kind regards,

Maks Kowalik
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx