Re: OSD scalability & thread stacksize

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 1, 2010 at 8:10 PM, Paul <paul_chiang@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Thu, Jul 1, 2010 at 10:51 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > Hi Paul,
> >
> > On Thu, 1 Jul 2010, Paul wrote:
> >> Follow up on the discussion on IRC late last night:
> >>
> >> On x86_64 2.6 kernels, pthread_create seems to allocate by default
> >> 8196KB stack size for the newly created thread. Since there can be
> >> potentially a large number of SimpleMessenger::Pipe instances (for
> >> example, when there are many OSDs and they need to heartbeat each
> >> other) and each instance has a reader and writer thread, a system can
> >> quickly run out of available memory to create new threads.
> >
> > This is surprising to me.  My understanding is that each thread is
> > allocated a big chunk of _virtual_ memory for its stack, but no physical
> > pages are allocated until that memory is actually touched.  That, at
> > least, is what I take away from e.g.
> >
> > http://www.kegel.com/stackcheck/
> >
> > Also, looking at the memory map for a random cmon process, I see the 8MB
> > stack, but Rss is only 8 KB:
> >
> > $ cat /proc/$pid/smaps
> > [...]
> > 7f18b27fe000-7f18b2ffe000 rw-p 00000000 00:00 0
> > Size:               8192 kB
> > Rss:                   8 kB
> > Pss:                   8 kB
> > Shared_Clean:          0 kB
> > Shared_Dirty:          0 kB
> > Private_Clean:         0 kB
> > Private_Dirty:         8 kB
> > Referenced:            8 kB
> > Swap:                  0 kB
> > KernelPageSize:        4 kB
> > MMUPageSize:           4 kB
> > 7f18b2ffe000-7f18b2fff000 ---p 00000000 00:00 0
> > Size:                  4 kB
> > Rss:                   0 kB
> > Pss:                   0 kB
> > Shared_Clean:          0 kB
> > Shared_Dirty:          0 kB
> > Private_Clean:         0 kB
> > Private_Dirty:         0 kB
> > Referenced:            0 kB
> > Swap:                  0 kB
> > KernelPageSize:        4 kB
> > MMUPageSize:           4 kB
> >
> > Do you see a large Rss in your environment?  Maybe it's a library behavior
> > thing?
> >
> > Or maybe the problem is just that the virtual memory reserved for thread
> > stacks is exhausted.  Maybe there is some way to make the process
> > initialization reserve a larger area of memory for thread stacks?
> >
> Yes, this seems to be what is happening. Taking a look, our kernels
> had ulimit -v set to equal total physical memory. After setting ulimit
> -v to unlimited, 32748 threads can be created regardless of what stack
> size is allocated to each thread, at which point pthread_create
> returns ENOMEM rather than EAGAIN. I'd still prefer to manage the
> allocated size though, since kernel settings might not be totally
> within our control.
>
> >> A short term solution would be to decrease the amount of stack space
> >> allocated for the reader and writer threads. I guess something along
> >> the lines of:
> >> http://github.com/tcloud/ceph/commit/39ffa236f3de2082c475a5ea5edc8afa09941bd6
> >> and
> >> http://github.com/tcloud/ceph/commit/1dbd42a5c4b064c581ddc152d41b9553f346df8a
> >
> > This seems reasonable as a workaround.
> >
> >> Yehudasa suggested a stacksize of 512KB, and it seems to work fine.
> >
> > Looking at the Rss value for stack threads in /proc/$pid/smaps would be a
> > pretty good way to see what kind of stack utilization those threads are
> > seeing.  I suspect something much smaller than 512KB would be safe (16
> > KB?).
> >
> Here's an excerpt of smaps from a random OSD's smaps:
> 7fecf7974000-7fecf79f4000 rw-p 00000000 00:00 0
> Size:                512 kB
> Rss:                  12 kB
> Pss:                  12 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:        12 kB
> Referenced:           12 kB
> Swap:                  0 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
>
> This is great info (and much more precise than trial and error!) -- it
> would appear 16KB is a safe minimum.

(resending reply due to some technical problem)

I'd be really careful about that. 16KB is pretty low (for userspace
applications), and there a few places where we allocate stuff
dynamically on the stack, so this test might not give you the true
requirements. We do need to make sure our stack allocation is
conservative.


Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux