Re: incomplete PG for erasure coding pool after OSD failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Anton,

With Erasure Coding the min_size (minimum number of shards/replicas needed to allow IO) of a pool is K+1 (in your case 4) so a single OSD failure already triggers an IO freeze (because k=3 m=1) if you have 5 equal hosts ceph 'should' get back to HEALTH_OK automatically (it will be backfilling/recovering from the 3 remaining OSD's, while IO is still frozen from the client side). But i think because your fifth node is not equal in weight ceph can't find any valid place to recover/backfill to leaving you with incomplete/inactive pgs.

You probably will need to add another OSD to each node so crush can find a place to store the data.

Another option is to look into the following blog to use erasure coding on small clusters: 

http://cephnotes.ksperis.com/blog/2017/01/27/erasure-code-on-small-clusters/

Kind regards,
Caspar




Met vriendelijke groet,

Caspar Smit
Systemengineer
SuperNAS
Dorsvlegelstraat 13
1445 PA Purmerend

t: (+31) 299 410 414
e: casparsmit@xxxxxxxxxxx
w: www.supernas.eu

2018-06-26 16:12 GMT+02:00 Anton Aleksandrov <anton@xxxxxxxxxxxxxx>:

Hello,

We have small cluster, initially on 4 hosts (1 osd per host, 8tb each) with erasure-coding for data-pool (k=3 m=1).

After some time I have added one more small host (1 osd, 2tb). Ceph has synced fine.

Then I have powered off one of first 8tb hosts and terminated it. Also removed from crush map and basically simulating that OSD has died. But no matter what - CEPH stays in HEALTH_WARN state and indicate incomplete PG, reduced data availability, pgs inactive and incomplete and also slow requests (even though we are not writing there right now).

Used disk space is small, just several gigabytes. WIth this test scenario I would expect, that Ceph would recalculate missing data from removed OSD and after some time become healthy again.

This did not happen automatically. Is there any special command for this? Is there any specific procedure to recalculate the data?

We are testing on Luminous and Bluestore, CephFS.

Anton.


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux