FPU Emulation and Signals - An Alternative Fix

"Kevin D. Kissell" <kevink@mips.com> · Sat, 4 Aug 2001 16:06:37 +0200

To recap, Pete found a hole in the FPU emulator code,
wherein a signal delivered between the setup and the
execution of the trampoline code for an instruction in
the delay slot of an emulated floating point branch
would trash the trampoline code and cause Bad
Things to happen.  Adding a more space between
the user stack pointer and the trampoline code 
solves the problem in a particular case, but is 
provably still at risk, since signal handlers can
allocate arbitrary amounts of stack storage.
Carsten and I both developed patches to inhibit
the delivery of signals during the trampoline.
This had the defect of causing processes to block
indefinitely in cases where the instruction being
executed by the trampoline itself causes an exception,
as Carsten was able to demonstrate using "crashme".

My patch used relatively heavyweight high-level
mechanisms, which have the virtue of being easily
configurable to allow certain signals from which the
signal handler is unable or unlikely to return (such
as SIGILL, SIGSEGV, and SIGBUS) to be delivered.
That seems to be OK for both crashme and Pete's
program, but I believe it is still broken for things like
trap instructions in the branch delay slot, and for
ptrace() single-stepping.  So I've got another idea
that I think is much better, and I'm kicking myself for
not having thought of it earlier.

The idea is to simply ensure that signal stack
frames always leave a space between the
current user SP and the stack frame itself.  If no
signals fire, fine.  If a signal is delivered and 
caught, its frame will be beyond the FP emulator
trampoline.  This is a trivial hack to get_sigframe()
that should be completely harmless (aside from
increased memory consumption), since it's
aleady set up to accept signal frames that aren't
on the stack as per Posix signal stacks.

There are, however, a flies in the ointment, as always.
Once we commit that original sin of allowing signals to 
"play through" during the trampoline sequence, we open 
the door to a signal handler containing FP branches itself, 
which would need to be emulated.  Yes, we could create 
another trampoline further up the stack, but the problem is
that the thread-specific data to allow recovery from the first 
trampoline will be overwritten by the second.  Furthermore,
the trampoline mechanism is terminated by catching the
next unaligned access fault for the thread.  A signal handler
could also generate an unaligned access fault.  The former
problem could probably be worked around acceptably by
simply having the FP trap handler notice that it is already
in a delay-slot-emulation sequence, and nail the process
with SIGFPE or SIGILL if it happens to do another one.
Not nice, but at least consistent, and the case is highly 
unlikely to arise.  But we've still got the unaligned access
case to consider.  So I submit the following for your consideration.
In the FP emulator, when we set up the trampoline, we set
up the following above  the user stack:

            Instruction-to-be-executed
            AdELOAD (unaligned load of r0 off of r0)
            Magic-cookie-that-is-an-unimplemented-instruction
            EPC-to-use-on-completion

Rather than use thread.desmul_epc as a flag to indicate
to the unaligned access handler that there is emulation
going on, the unaligned access handler would test to see
if the unaligned access instruction itself was the AdELOAD,
and that it is followed by the magic cookie, and if so, treat 
that as an indication to stuff the value following the sequence
(as indicated by EPC, modulo branch delays) into the EPC 
and return.   This way, one could nest an arbitrary number of 
emulation/signal/emulation sequences, and they should unroll 
correctly.  The further magic cookie is needed since, even though 
it's a completely useless instruction, the AdELOAD could be 
encountered for other reasons, as a misguided attempt
at cache prefetch, or by executing crashme. It's a useless instruction 
to emulate, having r0 as the destination, but the normal Linux semantics 
as I understand them would be to turn it  into a very expensive 
no-op and continue in what might otherwise be safe and sane 
execution.  With the addtional code word after the AdELOAD 
(and before the EPC) on  the dsemul stack that would be (more-or-less) 
guaranteed  to really be an illegal instruction, the only risk we would 
be taking would be to have a  program really containing the 
unaligned nonsense load followed by the  illegal instruction 
blow up, not on the illegal instruction, but by picking up a bogus 
EPC.  Not perfect.  But maybe the lesser of several evils.

I'm working on a prototype, but do you think that this scheme 
is viable?

            Regards,

            Kevin K.