Re: incomplete PG for erasure coding pool after OSD failure

Caspar Smit <casparsmit@xxxxxxxxxxx> · Tue, 26 Jun 2018 20:01:27 +0200

Hi Anton,
With Erasure Coding the min_size (minimum number of shards/replicas needed to allow IO) of a pool is K+1 (in your case 4) so a single OSD failure already triggers an IO freeze (because k=3 m=1) if you have 5 equal hosts ceph 'should' get back to HEALTH_OK automatically (it will be backfilling/recovering from the 3 remaining OSD's, while IO is still frozen from the client side). But i think because your fifth node is not equal in weight ceph can't find any valid place to recover/backfill to leaving you with incomplete/inactive pgs.

You probably will need to add another OSD to each node so crush can find a place to store the data.

Another option is to look into the following blog to use erasure coding on small clusters: 

http://cephnotes.ksperis.com/blog/2017/01/27/erasure-code-on-small-clusters/

Kind regards,
Caspar

Met vriendelijke groet,

Caspar Smit
Systemengineer
SuperNAS
Dorsvlegelstraat 13
1445 PA Purmerend

t: (+31) 299 410 414
e: casparsmit@xxxxxxxxxxx
w: www.supernas.eu

2018-06-26 16:12 GMT+02:00 Anton Aleksandrov <anton@xxxxxxxxxxxxxx>:

    Hello,
    We have small cluster, initially on 4 hosts (1
        osd per host, 8tb each) with erasure-coding for data-pool (k=3
        m=1). 

    After some time I have added one more small
        host (1 osd, 2tb). Ceph has synced fine.
    Then I have powered off one of first 8tb hosts
        and terminated it. Also removed from crush map and basically
        simulating that OSD has died. But no matter what - CEPH stays in
        HEALTH_WARN state and indicate incomplete PG, reduced data availability,
        pgs inactive and incomplete and also slow requests (even though
        we are not writing there right now). 

    Used disk space is small, just several
        gigabytes. WIth this test scenario I would expect, that Ceph
        would recalculate missing data from removed OSD and after some
        time become healthy again.
    This did not happen automatically. Is there
        any special command for this? Is there any specific procedure to
        recalculate the data? 

    We are testing on Luminous and Bluestore, CephFS.

    Anton.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com