Expanding pg's of an erasure coded pool

yguang11@xxxxxxxxx (Guang Yang) · Fri, 30 May 2014 08:56:37 +0800

On May 28, 2014, at 5:31 AM, Gregory Farnum <greg at inktank.com> wrote:

> On Sun, May 25, 2014 at 6:24 PM, Guang Yang <yguang11 at yahoo.com> wrote:
>> On May 21, 2014, at 1:33 AM, Gregory Farnum <greg at inktank.com> wrote:
>> 
>>> This failure means the messenger subsystem is trying to create a
>>> thread and is getting an error code back ? probably due to a process
>>> or system thread limit that you can turn up with ulimit.
>>> 
>>> This is happening because a replicated PG primary needs a connection
>>> to only its replicas (generally 1 or 2 connections), but with an
>>> erasure-coded PG the primary requires a connection to m+n-1 replicas
>>> (everybody who's in the erasure-coding set, including itself). Right
>>> now our messenger requires a thread for each connection, so kerblam.
>>> (And it actually requires a couple such connections because we have
>>> separate heartbeat, cluster data, and client data systems.)
>> Hi Greg,
>> Is there any plan to refactor the messenger component to reduce the num of threads? For example, use event-driven mode.
> 
> We've discussed it in very broad terms, but there are no concrete
> designs and it's not on the schedule yet. If anybody has conclusive
> evidence that it's causing them trouble they can't work around, that
> would be good to know?
Thanks for the response!

We used to have a cluster with each OSD host having 11 disks (daemons), on each host, there are around 15K threads, the system is stable but when there is cluster wide change (e.g. OSD down / out, recovery), we observed system load increasing, there is no cascading failure though.

Most recently we are evaluating Ceph against high density hardware with each OSD host having 33 disks (daemons), on each host, there are around 40K-50K threads, with some OSD host down/out, we started seeing high load increasing and a large volume of thread join/creation.

We don?t have a strong evidence that the messenger thread model is the problem and how event-driven approach can help, but I think as moving to high density hardware (for cost saving purpose), the issue could be amplified.

If there is any plan, it is good to know and we are very interested to involve.

Thanks,
Guang

> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com