Re: Need some advice about Pools and Erasure Coding

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Mon, 29 Apr 2019 11:36:56 +0200

Hi,

On 4/29/19 11:19 AM, Rainer Krienke wrote:
I am planning to set up a ceph cluster and already implemented a test
cluster where we are going to use RBD images for data storage (9 hosts,
each host has 16 OSDs, each OSD 4TB).
We would like to use erasure coded (EC)  pools here, and so all OSD are
bluestore. Since several projects are going to store data on this ceph
cluster I think it would make sense to use several EC coded pools for
separation of the projects and access control.

Now I have some questions I hope someone can help me with:

- Do I still (nautilus) need two pools for EC based RBD images, one EC
data pool and a second replicated pool for metadatata?
AFAIK the EC pools cannot store metadata at all, so you probably still 
need a separate replicated pool.

- If I do need two pools for RBD images and I want to separate the data of
different projects by using different pools with EC coding then how
should I handle the metadata pool which contains probably only a small
amount of data compared to the data pool?  Does it make sense to have
*one* replicated metadata pool (eg the default rbd pool) for all
projects and one EC pool for each project, or would it be better to
create one replicated and one EC pool for each project?

An alternative concept is using rados namespaces; each project uses its 
own namespace in a single replicated pool. Whether this works in your 
setup depends on the clients and whether they support namespaces.

On the other hand the PG autotuning in nautilus can keep the number of 
PGs low, so additional replicated pools won't be as bad as they were 
pre-nautilus.

- I also thought about the different k+m settings for a EC pool, for
example k=4, m=2 compared to k=8 and m=2. Both settings allow for two
OSDs to fail without any data loss, but I asked myself which of the two
settings would be more performant? On one hand distributing data to more
OSDs allows a higher parallel access to the data, that should result in
a faster access. On the other hand each OSD has a latency until
it can deliver its data shard. So is there a recommandation which of my
two k+m examples should be preferred?

I cannot comment on speed (interesting question, since we are about to 
setup a new cluster, too)...but I won't use k=8,m=2 in a setup with 9 
hosts only. You should have at least k+m+m hosts to handle hosts 
failures gracefully. So with nine hosts even k=6,m=2 might (and will) be 
a problem.

Regards,

Burkhard

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com