Re: librados pthread_create failure

Greg Poirier <greg.poirier@xxxxxxxxxx> · Mon, 26 Aug 2013 10:12:26 -0700

Gregs are awesome, apparently. Thanks for the confirmation.
I know that threads are light-weight, it's just the first time I've ever run into something that uses them... so liberally. ^_^

On Mon, Aug 26, 2013 at 10:07 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:

On Mon, Aug 26, 2013 at 9:24 AM, Greg Poirier <greg.poirier@xxxxxxxxxx> wrote:

> So, in doing some testing last week, I believe I managed to exhaust the

> number of threads available to nova-compute last week. After some

> investigation, I found the pthread_create failure and increased nproc for

> our Nova user to, what I considered, a ridiculous 120,000 threads after

> reading that librados will require a thread per osd, plus a few for

> overhead, per VM on our compute nodes.

>

> This made me wonder: how many threads could Ceph possibly need on one of our

> compute nodes.

>

> 32 cores * an overcommit ratio of 16, assuming each one is booted from a

> Ceph volume, * 300 (approximate number of disks in our soon-to-go-live Ceph

> cluster) = 153,600 threads.

>

> So this is where I started to put the truck in reverse. Am I right? What

> about when we triple the size of our Ceph cluster? I could easily see a

> future where we have easily 1,000 disks, if not many, many more in our

> cluster. How do people scale this? Do you RAID to increase the density of

> your Ceph cluster? I can only imagine that this will also drastically

> increase the amount of resources required on my data nodes as well.

>

> So... suggestions? Reading?

Your math looks right to me. So far though it hasn't caused anybody

any trouble — Linux threads are much cheaper than people imagine when

they're inactive. At some point we will certainly need to reduce the

thread counts of our messenger (using epoll on a bunch of sockets

instead of 2 threads -> 1 socket), but it hasn't happened yet.

In terms of things you can do if this does become a problem, the most

prominent is probably to (sigh) partition your cluster into pods on a

per-rack basis or something. This is actually not as bad as it sounds

since your network design probably would prefer not to send all writes

through your core router, so if you create a pool for each rack and do

something like this rack, next rack, next row for your replication you

get better network traffic patterns.

-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com