No hard feelings, we are all learning here :)
I would have sworn that the first version that gives multiple core files was 2.4.3, but perhaps I am wrong.
I do know that doing post-mortems of multithreaded apps is no longer the hellish ordeal that it used to be. It may be a slight pain to go through a large number of core files, in search of one that was actually not sleeping, but once the correct one is located, debugging is straightforward and reliable.
Glad you found your problem, though.
Don
Usman S. Ansari writes:
I hope there is no hard feeling, I only said that to clear my concept !!!
My configuration is running Linux 2.4.4. for Power PC 8xx series. There are 15 threads running including the master guy. The distribution is montavista, I can get gdb / compiler / library version tomorrow from work.
This problem was happening at two points.
One of the thread waits on a select, than touch watchdog and continue from select again. The other thread is always waiting on recv call for clients to send data.
These two cases were very easily reproducable. The core would always at select or recv. After two weeks of investigation, the cause was found in another thread.
The senario list below, does it apply to 2.4.4 also ???
Usman
Don Dade wrote:
As for your program's behavior, it has something to do with the fact that kernels before 2.4.3 did not generate a core file per thread associated with a process (e.g. core.<pid>) and further would not generate a core file for a process that is sharing virtual memory with another process. So when the app dumps core, the kernel goes about killing the processes in the group, but not generating core files until the last one, because the last thread is the first thread that isn't sharing part of its virtual address space (there's no one left with which to share). So an arbitrary thread gets the core file, which is most likely not the one that generated the signal. gdb 5.1.1 (I think) is fully thread aware, but there's just no way before 2.4.3 to know if the core file corresponds to the thread that caused the app to terminate. This is not the sort of knowledge I rely on every day, so my recollection might be incorrect; please correct me if I'm wrong.--
(2) When one of the thread gets a signal, say signal 11, than core is dumped. Recently, my
application was coring. Looking at the stack trace, it always showed one of the thread
which is sleeping most of the time on select call was getting the signal. In reality, the
cause of the problem was another thread, found out after some strugle.
My understanding is binfmt_elf.c dumps core of all relavent pages, including stacks of all
the threads, it was gdb which was not pthread aware ???
Usman
Usman S. Ansari
-- Kernelnewbies: Help each other learn about the Linux kernel. Archive: http://mail.nl.linux.org/kernelnewbies/ FAQ: http://kernelnewbies.org/faq/