Re: Requests blocked in degraded erasure coded pool

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 07 Jun 2017 18:29:29 +0000

Whoops, sent that too early. Let me try again.

On Wed, Jun 7, 2017 at 3:24 AM Jonas Jaszkowic <jonasjaszkowic@xxxxxxxxxxxxxx> wrote:
Thank you for your feedback! Do you have more information on why at least k+1 nodes need to be active in order for the cluster to work at this point?

Actually, I misread your email and misdiagnosed it into being too precise. In your case, you've got a 2+3 EC pool and killed 3 OSDs.

Roughly:
We prevent PGs from going active (and serving writes or reads) when they have less than "min_size" OSDs participating. This is generally set so that we have enough redundancy to recover from at least one OSD failing.

In your case, you have 2 OSDs and the failure of either one of them results in the loss of all written data. So we don't let you go active as it's not safe.

I am particularly interested in any material on the erasure coding implementations
in Ceph and how they work in depth. Sometimes the official documentation doesn’t
supply the needed information on problems beyond the point of a default cluster
setup. Are there any technical documentations on the implementation or something
similar?

http://docs.ceph.com/docs/master/dev/osd_internals/erasure_coding/ and the pages it links to
-Greg

Any help is appreciated.

Best regards, 
Jonas

Am 07.06.2017 um 08:00 schrieb Gregory Farnum <gfarnum@xxxxxxxxxx>:

On Tue, Jun 6, 2017 at 10:12 AM, Jonas Jaszkowic
<jonasjaszkowic@xxxxxxxxxxxxxx> wrote:
I setup a simple Ceph cluster with 5 OSD nodes and 1 monitor node. Each OSD
is on a different host.
The erasure coded pool has 64 PGs and an initial state of HEALTH_OK.

The goal is to deliberately break as many OSDs as possible up to the number
of coding chunks m in order to
evaluate the read performance when these chunks are missing. Per definition
of Reed-Solomon Coding, any
chunks out of the n=k+m total chunks can be missing. To simulate the loss of
an OSD I’m doing the following:

ceph osd set noup
ceph osd down <ID>
ceph osd out <ID>

With the above procedure I should be able to kill up to m = 3 OSDs without
loosing any data. However, when I kill k = 3 randomly selected OSDs,
all requests to the cluster are blocked and HEALTH_ERR is showing. The OSD
on which the requests are blocked is working properly and [in,up] in the
cluster.

My question: Why is it not possible to kill m = 3 OSDs and still operate the
cluster? Isn’t that equivalent to loosing data which
shouldn’t happen in this particular configuration? Is my cluster setup
properly or am I missing something?

Sounds like http://tracker.ceph.com/issues/18749, which, yeah, we need
to fix that. By default, with a k+m EC code, it currently insists on
at least one chunk more than the minimum k to go active.
-Greg

Thank you for your help!

I have attached all relevant information about the cluster and status
outputs:

Erasure coding profile:

jerasure-per-chunk-alignment=false
k=2
m=3
plugin=jerasure
ruleset-failure-domain=host
ruleset-root=default
technique=reed_sol_van
w=8

Content of ceph.conf:

[global]
fsid = 6353b831-22c3-424c-a8f1-495788e6b4e2
mon_initial_members = ip-172-31-27-142
mon_host = 172.31.27.142
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_min_size = 2
osd_pool_default_size = 2
mon_allow_pool_delete = true

Crush rule:

rule ecpool {
ruleset 1
type erasure
min_size 2
max_size 5
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step chooseleaf indep 0 type host
step emit
}

Output of 'ceph -s‘ while cluster is degraded:

   cluster 6353b831-22c3-424c-a8f1-495788e6b4e2
    health HEALTH_ERR
           38 pgs are stuck inactive for more than 300 seconds
           26 pgs degraded
           38 pgs incomplete
           26 pgs stuck degraded
           38 pgs stuck inactive
           64 pgs stuck unclean
           26 pgs stuck undersized
           26 pgs undersized
           2 requests are blocked > 32 sec
           recovery 3/5 objects degraded (60.000%)
           recovery 1/5 objects misplaced (20.000%)
           noup flag(s) set
    monmap e2: 1 mons at {ip-172-31-27-142=172.31.27.142:6789/0}
           election epoch 6, quorum 0 ip-172-31-27-142
       mgr no daemons active
    osdmap e194: 5 osds: 2 up, 2 in; 64 remapped pgs
           flags noup,sortbitwise,require_jewel_osds,require_kraken_osds
     pgmap v970: 64 pgs, 1 pools, 592 bytes data, 1 objects
           79668 kB used, 22428 MB / 22505 MB avail
           3/5 objects degraded (60.000%)
           1/5 objects misplaced (20.000%)
                 38 incomplete
                 15 active+undersized+degraded
                 11 active+undersized+degraded+remapped

Output of 'ceph health‘ while cluster is degraded:

HEALTH_ERR 38 pgs are stuck inactive for more than 300 seconds; 26 pgs
degraded; 38 pgs incomplete; 26 pgs stuck degraded; 38 pgs stuck inactive;
64 pgs stuck unclean; 26 pgs stuck undersized; 26 pgs undersized; 2 requests
are blocked > 32 sec; recovery 3/5 objects degraded (60.000%); recovery 1/5
objects misplaced (20.000%); noup flag(s) set

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com