Re: core dump analysis, was Re: stack smashing detected

Finn Thain <fthain@xxxxxxxxxxxxxx> · Wed, 5 Apr 2023 12:00:04 +1000 (AEST)

On Wed, 5 Apr 2023, Michael Schmitz wrote:

On 4/04/23 12:13, Finn Thain wrote:
It looks like I messed up. waitproc() appears to have been invoked
twice, which is why wait3 was invoked twice...

GNU gdb (Debian 13.1-2) 13.1
...
(gdb) set osabi GNU/Linux
(gdb) file /bin/dash
Reading symbols from /bin/dash...
Reading symbols from
/usr/lib/debug/.build-id/aa/4160f84f3eeee809c554cb9f3e1ef0686b8dcc.debug...
(gdb) b waitproc
Breakpoint 1 at 0xc346: file jobs.c, line 1168.
(gdb) b jobs.c:1180
Breakpoint 2 at 0xc390: file jobs.c, line 1180.
(gdb) run
Starting program: /usr/bin/dash
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/m68k-linux-gnu/libthread_db.so.1".
# x=$(:)
[Detaching after fork from child process 570]

Breakpoint 1, waitproc (status=0xeffff86a, block=1) at jobs.c:1168
1168    jobs.c: No such file or directory.
(gdb) c
Continuing.

Breakpoint 2, waitproc (status=0xeffff86a, block=1) at jobs.c:1180
1180    in jobs.c
(gdb) info locals
oldmask = {__val = {1997799424, 49154, 396623872, 184321, 3223896090, 53249,
     3836788738, 1049411610, 867225601, 3094609920, 0, 1048580, 2857693183,
     4184129547, 3435708442, 863764480, 184321, 3844141055, 4190425089,
     4127248385, 3094659084, 597610497, 4135112705, 3844079616, 131072,
     37355520, 184320, 3878473729, 3844132865, 3094663168, 3549089793,
     3844132865}}
flags = 2
err = 570
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) c
Continuing.

Breakpoint 1, waitproc (status=0xeffff86a, block=0) at jobs.c:1168
1168    in jobs.c
(gdb) c
Continuing.

Breakpoint 2, waitproc (status=0xeffff86a, block=0) at jobs.c:1180
1180    in jobs.c
(gdb) info locals
oldmask = {__val = {1997799424, 49154, 396623872, 184321, 3223896090, 53249,
     3836788738, 1049411610, 867225601, 3094609920, 0, 1048580, 2857693183,
     4184129547, 3435708442, 863764480, 184321, 3844141055, 4190425089,
     4127248385, 3094659084, 597610497, 4135112705, 3844079616, 131072,
     37355520, 184320, 3878473729, 3844132865, 3094663168, 3549089793,
     3844132865}}
flags = 3
err = -1
oldmask = <optimized out>
flags = <optimized out>
err = <optimized out>
(gdb) c
Continuing.
#

That means we may well see both signals delivered at the same time if the
parent shell wasn't scheduled to run until the second subshell terminated
(answering the question I was about to ask on your other mail, the one about
the crashy script with multiple subshells).

How is that possible? If the parent does not get scheduled, the second 
fork will not take place.

Now does waitproc() handle that case correctly? The first signal 
delivered results in err == child PID so the break is taken, causing 
exit from waitproc().

I don't follow. Can you rephrase that perhaps?

For a single subshell, the SIGCHLD signal can be delivered before wait4 is 
called or after it returns. For example, $(sleep 5) seems to produce the 
latter whereas $(:) tends to produce the former.

Does waitproc() get called repeatedly until an error is returned?

It's complicated...
https://sources.debian.org/src/dash/0.5.12-2/src/jobs.c/?hl=1122#L1122

I don't care that much what dash does as long as it isn't corrupting it's 
own stack, which is a real possibility, and one which gdb's data watch 
point would normally resolve. And yet I have no way to tackle that.

I've been running gdb under QEMU, where the failure is not reproducible. 
Running dash under gdb on real hardware is doable (RAM permitting). But 
the failure is intermittent even then -- it only happens during execution 
of certain init scripts, and I can't reproduce it by manually running 
those scripts.

(Even if I could reproduce the failure under gdb, instrumenting execution 
in gdb can alter timing in undesirable ways...)

So, again, the best avenue I can think of for such experiments to modify 
the kernel to either keep track of the times of the wait4 syscalls and 
signal delivery and/or push the timing one way or the other e.g. by 
delaying signal delivery, altering scheduler behaviour, etc. But I don't 
have code for that. I did try adding random delays around kernel_wait4() 
but it didn't have any effect...