On Fri, 7 Apr 2023, Michael Schmitz wrote:
Am 05.04.2023 um 14:00 schrieb Finn Thain:
On Wed, 5 Apr 2023, Michael Schmitz wrote:
That means we may well see both signals delivered at the same time if
the parent shell wasn't scheduled to run until the second subshell
terminated (answering the question I was about to ask on your other
mail, the one about the crashy script with multiple subshells).
How is that possible? If the parent does not get scheduled, the second
fork will not take place.
I assumed subshells could run asynchronously, and the parent shell
continue until it hits a statement that needs the result of one of the
subshells.
That would be nice but I don't think dash is so sophisticated as to keep
track of data dependencies between the various expressions and commands in
a shell script.
What is the point of subshells, if not to allow this?
$ x=$(exit 123)
$ echo $?
123
$ set -e
$ x=$(false)
Anyway, my gut says that we're barking up the wrong tree. My recent tests
show that the failure is not uniformly random. Either the script fails
often or it fails not at all. It's as if there was some unknown variable
that caused dash to corrupt its own stack.
Running dash under gdb on real hardware is doable (RAM permitting).
But the failure is intermittent even then -- it only happens during
execution of certain init scripts, and I can't reproduce it by
manually running those scripts.
(Even if I could reproduce the failure under gdb, instrumenting
execution in gdb can alter timing in undesirable ways...)
So, again, the best avenue I can think of for such experiments to
modify the kernel to either keep track of the times of the wait4
syscalls and
The easiest way to do that is to log all wait and signal syscalls, as
well as process exit. That might alter timing if these log messages go
to the serial console though. Is that what you have in mind?
What I had in mind was collecting measurements in such way that would not
impact timing, perhaps by storing them somewhere they could be retrieved
from the process core dump.
But that's probably not realistic and it's probably pointless anyway -- I
don't expect to find an old bug in common code like kernel/exit.c, or in a
hot path like those in arch/m68k/kernel/entry.S.
More likely is that some kind of bug in dash causes it to corrupt its own
stack when conditions are just right. I just need to figure out how to
recreate those conditions. :-/
When dash is feeling crashy, you can get results like this:
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
*** stack smashing detected ***: terminated
Aborted (core dumped)
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~#
But when it's not feeling crashy, you can't:
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument.
root@debian:~# sh /etc/init.d/mountdevsubfs.sh
Warning: mountdevsubfs should be called with the 'start' argument.
The only way I have found to alter dash's inclination to crash is to
reboot. (I said previously I was unable to reproduce this in a single user
mode shell but it turned out to be more subtle.)
signal delivery and/or push the timing one way or the other e.g. by
delaying signal delivery, altering scheduler behaviour, etc. But I
don't have code for that. I did try adding random delays around
kernel_wait4() but it didn't have any effect...
I wonder whether it's possible to delay process exit (and parent process
signaling) by placing the exit syscall on a timer workqueue. But the
same effect could be had by inserting a sleep before subshell exit ...
And causing a half-dead task to schedule in order to delay signaling
doesn't seem safe to me ...