On Mon, 16 Jun 2003, James Olin Oden wrote: > On Mon, 16 Jun 2003, Bill Nottingham wrote: > > > James Olin Oden (joden@xxxxxxxxxxxxxxxxxxxxx) said: > > > On Mon, 16 Jun 2003, Bill Nottingham wrote: > > > > > > > James Olin Oden (joden@xxxxxxxxxxxxxxxxxxxxx) said: > > > > > and looked at things. The last syscall I see init in after > > > > > running the init 6, is: > > > > > > > > > > futex(0x4212f1f4, FUTEX_WAIT, -1, NULL > > > > > > > > What glibc are you running? > > > > > > > I am running: > > > > > > glibc-2.3.2-27.9 > > > > > > I think this is the latest errata...I just downloaded all the errata (well > > > what I did not have) today, and it was the most recent one. BTW, I was > > > trying to recompile this version of glibc without stripping its symbols, > > > and I get the following error: > > > > Are you running the errata kernel as well? > > > I am now running with 2.4.20-18.9bigmem, and the problem is still > occuring. > Got it! Here is what is happening when you run init 6 with the debug output turned on in init: 1) init reads the fifo when it gets around to it. 2) It sees there is request for a runlevel change (6), and begins killing appropriate processes. 3) One of those processes will be a getty, inevitably. The getty goes away, and inevitably some children are left behind. They are given to init by the kernel, and the kernel sends SIGCHILD to init. 4) Meanwhile back in init, it has been going through its init_main loop again, and is printing debug output to this effect and sending it to syslog. When it sends the message via the syslog call a futex is created so that other processes can't do this till its done. 5) While its in the glibc code, init receives the SIGCHILD and and in the child handler it calls log() again set to send output to syslog and the console. 6) When it tries to send the child handler log message to syslog it enters the glibc code that blocks waiting on the futex...and there it sits. I patched init to block all signals while talking to syslog, and this seems to have fixed it. I will submit a patch via bugzilla in the morning. This probably seems to only happen on our duel processor machines because the sigchild can truly be sent asynchronously from init. That is my theory anyway. This was problem number two, though, so I will be back problem number one soon (the internal buffer overflow). I am pretty sure what is happening in that scenario: 1) init goes to print "Entering runlevel 4", only the runlevel data is munged causing a segfault in syslog. 2) The segv handler is kicked off and tries to log its message. It can't, because the lock has not gone away on the syslog code. 3) init hangs waiting on the futex. This corruption though is much more infrequent (sometimes requiring hundreds of reboots), but with the patch I did, expect to see it happen only this time get a core. Cheers...james > Cheers...james > > Bill > > > > > > _______________________________________________ > > Redhat-devel-list mailing list > > Redhat-devel-list@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/redhat-devel-list > > > > > _______________________________________________ > Redhat-devel-list mailing list > Redhat-devel-list@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/redhat-devel-list > _______________________________________________ Redhat-devel-list mailing list Redhat-devel-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/redhat-devel-list