Re: librados pthread_create failure

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 26 Aug 2013 10:07:49 -0700



On Mon, Aug 26, 2013 at 9:24 AM, Greg Poirier <greg.poirier@xxxxxxxxxx> wrote:
> So, in doing some testing last week, I believe I managed to exhaust the
> number of threads available to nova-compute last week. After some
> investigation, I found the pthread_create failure and increased nproc for
> our Nova user to, what I considered, a ridiculous 120,000 threads after
> reading that librados will require a thread per osd, plus a few for
> overhead, per VM on our compute nodes.
>
> This made me wonder: how many threads could Ceph possibly need on one of our
> compute nodes.
>
> 32 cores * an overcommit ratio of 16, assuming each one is booted from a
> Ceph volume, * 300 (approximate number of disks in our soon-to-go-live Ceph
> cluster) = 153,600 threads.
>
> So this is where I started to put the truck in reverse. Am I right? What
> about when we triple the size of our Ceph cluster? I could easily see a
> future where we have easily 1,000 disks, if not many, many more in our
> cluster. How do people scale this? Do you RAID to increase the density of
> your Ceph cluster? I can only imagine that this will also drastically
> increase the amount of resources required on my data nodes as well.
>
> So... suggestions? Reading?

Your math looks right to me. So far though it hasn't caused anybody
any trouble — Linux threads are much cheaper than people imagine when
they're inactive. At some point we will certainly need to reduce the
thread counts of our messenger (using epoll on a bunch of sockets
instead of 2 threads -> 1 socket), but it hasn't happened yet.
In terms of things you can do if this does become a problem, the most
prominent is probably to (sigh) partition your cluster into pods on a
per-rack basis or something. This is actually not as bad as it sounds
since your network design probably would prefer not to send all writes
through your core router, so if you create a pool for each rack and do
something like this rack, next rack, next row for your replication you
get better network traffic patterns.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com