Re: OSD scalability & thread stacksize

Paul <paul_chiang@xxxxxxxxxxxxxxxxxxx> · Fri, 2 Jul 2010 11:10:30 +0800

On Thu, Jul 1, 2010 at 10:51 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> Hi Paul,
>
> On Thu, 1 Jul 2010, Paul wrote:
>> Follow up on the discussion on IRC late last night:
>>
>> On x86_64 2.6 kernels, pthread_create seems to allocate by default
>> 8196KB stack size for the newly created thread. Since there can be
>> potentially a large number of SimpleMessenger::Pipe instances (for
>> example, when there are many OSDs and they need to heartbeat each
>> other) and each instance has a reader and writer thread, a system can
>> quickly run out of available memory to create new threads.
>
> This is surprising to me.  My understanding is that each thread is
> allocated a big chunk of _virtual_ memory for its stack, but no physical
> pages are allocated until that memory is actually touched.  That, at
> least, is what I take away from e.g.
>
> http://www.kegel.com/stackcheck/
>
> Also, looking at the memory map for a random cmon process, I see the 8MB
> stack, but Rss is only 8 KB:
>
> $ cat /proc/$pid/smaps
> [...]
> 7f18b27fe000-7f18b2ffe000 rw-p 00000000 00:00 0
> Size:               8192 kB
> Rss:                   8 kB
> Pss:                   8 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:         8 kB
> Referenced:            8 kB
> Swap:                  0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> 7f18b2ffe000-7f18b2fff000 ---p 00000000 00:00 0
> Size:                  4 kB
> Rss:                   0 kB
> Pss:                   0 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:         0 kB
> Referenced:            0 kB
> Swap:                  0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
>
> Do you see a large Rss in your environment?  Maybe it's a library behavior
> thing?
>
> Or maybe the problem is just that the virtual memory reserved for thread
> stacks is exhausted.  Maybe there is some way to make the process
> initialization reserve a larger area of memory for thread stacks?
>
Yes, this seems to be what is happening. Taking a look, our kernels
had ulimit -v set to equal total physical memory. After setting ulimit
-v to unlimited, 32748 threads can be created regardless of what stack
size is allocated to each thread, at which point pthread_create
returns ENOMEM rather than EAGAIN. I'd still prefer to manage the
allocated size though, since kernel settings might not be totally
within our control.

>> A short term solution would be to decrease the amount of stack space
>> allocated for the reader and writer threads. I guess something along
>> the lines of:
>> http://github.com/tcloud/ceph/commit/39ffa236f3de2082c475a5ea5edc8afa09941bd6
>> and
>> http://github.com/tcloud/ceph/commit/1dbd42a5c4b064c581ddc152d41b9553f346df8a
>
> This seems reasonable as a workaround.
>
>> Yehudasa suggested a stacksize of 512KB, and it seems to work fine.
>
> Looking at the Rss value for stack threads in /proc/$pid/smaps would be a
> pretty good way to see what kind of stack utilization those threads are
> seeing.  I suspect something much smaller than 512KB would be safe (16
> KB?).
>
Here's an excerpt of smaps from a random OSD's smaps:
7fecf7974000-7fecf79f4000 rw-p 00000000 00:00 0
Size:                512 kB
Rss:                  12 kB
Pss:                  12 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        12 kB
Referenced:           12 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB

This is great info (and much more precise than trial and error!) -- it
would appear 16KB is a safe minimum.

>> However, as the cluster grows, there will eventually be some point
>> where we hit a hard limit on either the number of concurrent threads
>> or the number of concurrent tcp connections. Is it possible to
>> redesign SimpleMessenger and/or the heartbeat mechanism so that only a
>> constant number of connections are established?
>
> Well, the number of peers an OSD has is generally bounded (it's related to
> the number of PGs each OSD gets).  The number of clients is not, though.
> The messenger should put the Pipes in some sort of LRU so that it can
> close out old, idle connections.  For the MDS the connection state needs
> to stick around, but it shouldn't be hard to make the reader/writer
> threads stop when it goes into a STANDBY state (if they don't already).
>
> Adding a hard limit is also doable, although I would worry about that just
> slowing things down in large clusters when peers keep having to reconnect.
>
> sage
>

Thanks,
Paul C
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html