Re: Can't create erasure coded pools with k+m greater than hosts?

Frank Schilder <frans@xxxxxx> · Thu, 24 Oct 2019 07:24:13 +0000

I have some experience with an EC set-up with 2 shards per host, failure-domain is host, and also some multi-site wishful thinking of users. What I learned are the following:

1) Avoid this work-around too few hosts for EC rule at all cost. There are two types of resiliency in ceph. One is against hardware fails and the other is against admin fails. Using a non-standard crush set-up to accommodate for a lack of hosts dramatically reduces resiliency against admin fails. You will have down-time due to simple mistakes. You will need to adjust also other defaults, like min_size, to be able to do anything on this cluster without downtime, sweating every time and praying that nothing goes wrong. Use this only if there is a short-term horizon that it will be over.

2) Do not use EC 2+1. It does not offer anything interesting for production. Use 4+2 (or 8+2, 8+3 if you have the hosts). Here you can operate with non-zero redundancy while doing maintenance (min_size=5).

3) If you have no perspective of getting at least 7 servers in the long run (4+2=6 for EC profile, +1 for fail-over automatic rebuild), do not go for EC. If this helps in your negotiations, tell everyone that they either give you more servers now and get low-cost storage, or have to pay for expensive replicated storage forever.

4) Before you start thinking about replicating to a second site, you should have a primary site running solid first. I was in exactly the same situation, people expecting wonders with giving me half the stuff I need only. Simply do not do it. I wasted a lot of time on impossible requests. With the hardware you have, I would ditch the second DC and rather start building up a solid first DC to be mirrored later when people move over bags with money. You have 6 servers. That's a good start for an 4+2 EC pool. You will not have fail-over capacity, but at least you don't have to work around too many exceptions. The one you should be aware of though is this one: https://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/?highlight=erasure%20code%20pgs#crush-gives-up-too-soon . If you had 7 servers, you would be out of trouble.

This is collected from my experience. I would do things different now and maybe it helps you with deciding how to proceed. Its basically about what resources can you expect in the foreseeable future and what compromises are you willing to make with regards to sleep and sanity.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Salsa <salsa@xxxxxxxxxxxxxx>
Sent: 21 October 2019 17:31
To: Martin Verges
Cc: ceph-users
Subject: Re:  Can't create erasure coded pools with k+m greater than hosts?

Just to clarify my situation, We have 2 datacenters with 3 hosts each, 12 4TB disks each host (2 are RAID with OS installed and the remaining 10 are used for Ceph). Right now I'm trying a single DC installation and intended to migrate to multi site mirroring DC1 to DC2, so if we lose DC1 we can activate DC2 (NOTE: I have no idea how this is setup and have not planned at all; I thought of geting DC1 to work first and later set the mirroring)

I don't think I'll be able to change the setup in any way, so my next question is: Should I go with a replica 3 or would an erasure 2,1 be ok?

There's a very small chance we get 2 extra hosts for each DC in a near future, but we'll probably use all the available storage space in the nearer future.

We're trying to use as much space as possible.

Thanks;

--
Salsa

Sent with ProtonMail<https://protonmail.com> Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, October 21, 2019 2:53 AM, Martin Verges <martin.verges@xxxxxxxx> wrote:

Just don't do such setups for production, It will be a lot of pain, trouble, and cause you problems.

Just take a cheap system, put some of the disks in it and do a way way better deployment than something like 4+2 on 3 hosts. Whatever you do with that cluster (example kernel update, reboot, PSU failure, ...) causes you and all attached clients, especially bad with VMs on that Ceph cluster, to stop any IO or even crash completely.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.verges@xxxxxxxx<mailto:martin.verges@xxxxxxxx>
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

Am Sa., 19. Okt. 2019 um 01:51 Uhr schrieb Chris Taylor <ctaylor@xxxxxxxxxx<mailto:ctaylor@xxxxxxxxxx>>:
Full disclosure - I have not created an erasure code pool yet!

I have been wanting to do the same thing that you are attempting and
have these links saved. I believe this is what you are looking for.

This link is for decompiling the CRUSH rules and recompiling:

https://docs.ceph.com/docs/luminous/rados/operations/crush-map-edits/

This link is for creating the EC rules for 4+2 with only 3 hosts:

https://ceph.io/planet/erasure-code-on-small-clusters/

I hope that helps!

Chris

On 2019-10-18 2:55 pm, Salsa wrote:
> Ok, I'm lost here.
>
> How am I supposed to write a crush rule?
>
> So far I managed to run:
>
> #ceph osd crush rule dump test -o test.txt
>
> So I can edit the rule. Now I have two problems:
>
> 1. Whats the functions and operations to use here? Is there
> documentation anywhere abuot this?
> 2. How may I create a crush rule using this file? 'ceph osd crush rule
> create ... -i test.txt' does not work.
>
> Am I taking the wrong approach here?
>
>
> --
> Salsa
>
> Sent with ProtonMail Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Friday, October 18, 2019 3:56 PM, Paul Emmerich
> <paul.emmerich@xxxxxxxx<mailto:paul.emmerich@xxxxxxxx>> wrote:
>
>> Default failure domain in Ceph is "host" (see ec profile), i.e., you
>> need at least k+m hosts (but at least k+m+1 is better for production
>> setups).
>> You can change that to OSD, but that's not a good idea for a
>> production setup for obvious reasons. It's slightly better to write a
>> crush rule that explicitly picks two disks on 3 different hosts
>>
>> Paul
>>
>> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at
>> https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io<http://www.croit.io>
>> Tel: +49 89 1896585 90
>>
>> On Fri, Oct 18, 2019 at 8:45 PM Salsa salsa@xxxxxxxxxxxxxx<mailto:salsa@xxxxxxxxxxxxxx> wrote:
>>
>> > I have probably misunterstood how to create erasure coded pools so I may be in need of some theory and appreciate if you can point me to documentation that may clarify my doubts.
>> > I have so far 1 cluster with 3 hosts and 30 OSDs (10 each host).
>> > I tried to create an erasure code profile like so:
>> > "
>> >
>> > ceph osd erasure-code-profile get ec4x2rs
>> >
>> > ==========================================
>> >
>> > crush-device-class=
>> > crush-failure-domain=host
>> > crush-root=default
>> > jerasure-per-chunk-alignment=false
>> > k=4
>> > m=2
>> > plugin=jerasure
>> > technique=reed_sol_van
>> > w=8
>> > "
>> > If I create a pool using this profile or any profile where K+M > hosts , then the pool gets stuck.
>> > "
>> >
>> > ceph -s
>> >
>> > ========
>> >
>> > cluster:
>> > id: eb4aea44-0c63-4202-b826-e16ea60ed54d
>> > health: HEALTH_WARN
>> > Reduced data availability: 16 pgs inactive, 16 pgs incomplete
>> > 2 pools have too many placement groups
>> > too few PGs per OSD (4 < min 30)
>> > services:
>> > mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 11d)
>> > mgr: ceph01(active, since 74m), standbys: ceph03, ceph02
>> > osd: 30 osds: 30 up (since 2w), 30 in (since 2w)
>> > data:
>> > pools: 11 pools, 32 pgs
>> > objects: 0 objects, 0 B
>> > usage: 32 GiB used, 109 TiB / 109 TiB avail
>> > pgs: 50.000% pgs not active
>> > 16 active+clean
>> > 16 creating+incomplete
>> >
>> > ceph osd pool ls
>> >
>> > =================
>> >
>> > test_ec
>> > test_ec2
>> > "
>> > The pool will never leave this "creating+incomplete" state.
>> > The pools were created like this:
>> > "
>> >
>> > ceph osd pool create test_ec2 16 16 erasure ec4x2rs
>> >
>> > ====================================================
>> >
>> > ceph osd pool create test_ec 16 16 erasure
>> >
>> > ===========================================
>> >
>> > "
>> > The default profile pool is created correctly.
>> > My profiles are like this:
>> > "
>> >
>> > ceph osd erasure-code-profile get default
>> >
>> > ==========================================
>> >
>> > k=2
>> > m=1
>> > plugin=jerasure
>> > technique=reed_sol_van
>> >
>> > ceph osd erasure-code-profile get ec4x2rs
>> >
>> > ==========================================
>> >
>> > crush-device-class=
>> > crush-failure-domain=host
>> > crush-root=default
>> > jerasure-per-chunk-alignment=false
>> > k=4
>> > m=2
>> > plugin=jerasure
>> > technique=reed_sol_van
>> > w=8
>> > "
>> > From what I've read it seems to be possible to create erasure code pools with higher than hosts K+M. Is this not so?
>> > What am I doing wrong? Do I have to create any special crush map rule?
>> > --
>> > Salsa
>> > Sent with ProtonMail Secure Email.
>> >
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com