Re: [PATCH v4 0/3] m68k: Improved switch stack handling

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Fri, 23 Jul 2021 17:31:20 -0500

Michael Schmitz <schmitzmic@xxxxxxxxx> writes:

> Hi Eric,
>
> Am 23.07.2021 um 02:49 schrieb Eric W. Biederman:
>> Michael Schmitz <schmitzmic@xxxxxxxxx> writes:
>>
>>> Hi Eric,
>>>
>>> On 21/07/21 8:32 am, Eric W. Biederman wrote:
>>>>
>>>>> diff --git a/arch/m68k/fpsp040/skeleton.S b/arch/m68k/fpsp040/skeleton.S
>>>>> index a8f4161..6c92d38 100644
>>>>> --- a/arch/m68k/fpsp040/skeleton.S
>>>>> +++ b/arch/m68k/fpsp040/skeleton.S
>>>>> @@ -502,7 +502,17 @@ in_ea:
>>>>>   	.section .fixup,#alloc,#execinstr
>>>>>   	.even
>>>>>   1:
>>>>> +
>>>>> +	SAVE_ALL_INT
>>>>> +	SAVE_SWITCH_STACK
>>>>          ^^^^^^^^^^
>>>>
>>>> I don't think this saves the registers in the well known fixed location
>>>> on the stack because some registers are saved at the exception entry
>>>> point.
>>>
>>> The FPU exception entry points are not using the exception entry code in
>>> head.S. These entry points are stored in the exception vector table directly. No
>>> saving of a syscall stack frame happens there. The FPU places its exception
>>> frame on the stack, and that is what the FPU exception handlers use.
>>>
>>> (If these have to call out to the generic exception handlers again, they will
>>> build a minimal stack frame, see code in skeleton.S.)
>>>
>>> Calling fpsp040_die() is no different from calling a syscall that may need to
>>> have access to the full stack frame. The 'fixed location' is just 'on the stack
>>> before calling  fpsp040_die()', again this is no different from calling
>>> e.g. sys_fork() which does not take a pointer to the begin of the stack frame as
>>> an argument.
>>>
>>> I must admit I never looked at how do_exit() figures out where the stack frame
>>> containing the saved registers is stored, I just assumed it unwinds the stack up
>>> to the point where the caller syscall was made, and works from there. The same
>>> strategy ought to work here.
>>
>> For do_exit the part we need to be careful with is PTRACE_EVENT_EXIT,
>> which means it is ptrace that we need to look at.
>>
>> For m68k the code in put_reg and get_reg finds the registers by looking
>> at task->thread.esp0.
>
> Thanks, that's what I was missing here.
>>
>> I was expecting m68k to use the same technique as alpha which expects a
>> fixed offset from task_stack_page(task).
>>
>> So your code will work if you add code to update task->thread.esp0 which
>> is also known as THREAD_ESP0 in entry.S
>
> Shoving
>
> movel   %sp,%curptr@(TASK_THREAD+THREAD_ESP0)
>
> in between the SAVE_ALL_INT and SAVE_SWITCH_STACK ought to do the
> trick there.
>
>>
>>>> Without being saved at the well known fixed location if some process
>>>> stops in PTRACE_EVENT_EXIT in do_exit we likely get some complete
>>>> gibberish.
>>>>
>>>> That is probably safe.
>>>>
>>>>>   	jbra	fpsp040_die
>>>>> +	addql   #8,%sp
>>>>> +	addql   #8,%sp
>>>>> +	addql   #8,%sp
>>>>> +	addql   #8,%sp
>>>>> +	addql   #8,%sp
>>>>> +	addql   #4,%sp
>>>>> +	rts
>>>> Especially as everything after jumping to fpsp040_die does not execute.
>>>
>>> Unless we change fpsp040_die() to call force_sig(SIGSEGV).
>>
>> Yes.  I think we would probably need to have it also call get_signal and
>> all of that, because I don't think the very light call path for that
>> exception includes testing if signals are pending.
>
> As far as I can see, there is a test for pending signals:
>
> ENTRY(ret_from_exception)
> .Lret_from_exception:
>         btst    #5,%sp@(PT_OFF_SR)      | check if returning to kernel
>         bnes    1f                      | if so, skip resched, signals
>         | only allow interrupts when we are really the last one on the
>         | kernel stack, otherwise stack overflow can occur during
>         | heavy interrupt load
>         andw    #ALLOWINT,%sr
>
> resume_userspace:
>         movel   %curptr@(TASK_STACK),%a1
>         moveb   %a1@(TINFO_FLAGS+3),%d0	| bits 0-7 of TINFO_FLAGS
>         jne     exit_work		| any bit set? -> exit_work
> 1:      RESTORE_ALL
>
> exit_work:
>         | save top of frame
>         movel   %sp,%curptr@(TASK_THREAD+THREAD_ESP0)
>         lslb    #1,%d0			| shift out TIF_NEED_RESCHED
>         jne     do_signal_return	| any remaining bit
> (signal/notify_resume)? -> do_signal_return
>         pea     resume_userspace
>         jra     schedule
>
> As long as TIF_NOTIFY_SIGNAL or TIF_SIGPENDING are set,
> do_signal_return will be called.

I was going to say I don't think so, as my tracing of
the code lead in a couple of different directions.  Upon closer
inspection all those paths either lead to fpsp_done or more
directly to ret_from_exception.

For anyone else who might want to trace the code, or for myself later on
when I forget.  As best as I can figure the hardware exception vector
table is setup in: arch/m68k/kernel/vector.c

For the vectors in question it appears to be this chunk of code:

	if (CPU_IS_040 && !FPU_IS_EMU) {
		/* set up FPSP entry points */
		asmlinkage void dz_vec(void) asm ("dz");
		asmlinkage void inex_vec(void) asm ("inex");
		asmlinkage void ovfl_vec(void) asm ("ovfl");
		asmlinkage void unfl_vec(void) asm ("unfl");
		asmlinkage void snan_vec(void) asm ("snan");
		asmlinkage void operr_vec(void) asm ("operr");
		asmlinkage void bsun_vec(void) asm ("bsun");
		asmlinkage void fline_vec(void) asm ("fline");
		asmlinkage void unsupp_vec(void) asm ("unsupp");

		vectors[VEC_FPDIVZ] = dz_vec;
		vectors[VEC_FPIR] = inex_vec;
		vectors[VEC_FPOVER] = ovfl_vec;
		vectors[VEC_FPUNDER] = unfl_vec;
		vectors[VEC_FPNAN] = snan_vec;
		vectors[VEC_FPOE] = operr_vec;
		vectors[VEC_FPBRUC] = bsun_vec;
		vectors[VEC_LINE11] = fline_vec;
		vectors[VEC_FPUNSUP] = unsupp_vec;
	}


Which leads me to call traces that look like this:

hw
  fline
    fpsp_fline
       mem_read
          user_read
             copyin
               in_ea
                  <page-fault>
                     fpsp040_die

If that mem_read returns it can be followed by
       not_mvcr
          real_fline
            ret_from_exception

Or it can be followed by
       fix_con
          uni_2
             gen_except
                 do_clean
                    finish_up
                       fpsp_done
                          ret_from_exception
                          


>> The way the code is structured it is actively incorrect to return from
>> fpsp040_die, as the code does not know what to do if it reads a byte
>> from userspace and there is nothing there.
>
> Correct - my hope is that upon return from the FPU exception (that
> continued after a dodgy read or write), we get the signal delivered
> and will die then.

Yes.  That does look like a good strategy.

I am wondering if there are values we can return that will make the
path out of the exit routine more deterministic.

I have played with that a little bit today, but it doesn't look like I
am going to have time to put together any kind of real patch today.

Simply modifying fpsp040_die to call force_sigsegv(SIGSEGV) should be
enough to trigger a signal (no call stack work needed if we remove
do_exit).  The tricky bit is what value do we want to fake when we
can not read anything from userspace.   For a write fault we should just
be able to skip the write entirely.

In both cases we probably should break out of the loop prematurely.  But
I don't know if that is necessary.


The lazy strategy would be to copy the ifpsp060 code and simply oops
the kernel if the read or write of userspace gets a page fault.

>> So instead of handling -EFAULT like most pieces of kernel code the code
>> just immediately calls do_exit, and does not even attempt to handle
>> the error.
>>
>> That is not my favorite strategy at all, but I suspect it isn't worth
>> it, or safe to update the skeleton.S to handle errors.  Especially as we
>> have not even figured out how to test that code yet.
>
> That's bothering me more than a little, but I need to find out whether
> the emulator even handles FPU exceptions correctly ...


As a fallback plan we can following the lead of ifpsp060/os.S and simply
not catch the kernel triggered page fault, and let
arch/m68k/mm/fault.c:send_fault_sig() return a kernel oops.  It is not
ideal as it allows userspace to trigger a kernel oops, but it does at
least keep the kernel in a consistent state.

diff --git a/arch/m68k/fpsp040/skeleton.S b/arch/m68k/fpsp040/skeleton.S
index a8f41615d94a..4c6c4b07ef38 100644
--- a/arch/m68k/fpsp040/skeleton.S
+++ b/arch/m68k/fpsp040/skeleton.S
@@ -479,7 +479,6 @@ copyout:
 |	movec	%d1,%DFC		| set dfc for user data space
 moreout:
 	moveb	(%a0)+,%d1	| fetch supervisor byte
-out_ea:
 	movesb	%d1,(%a1)+	| write user byte
 	dbf	%d0,moreout
 	rts
@@ -493,21 +492,9 @@ copyin:
 |	SFC is already set
 |	movec	%d1,%SFC		| set sfc for user space
 morein:
-in_ea:
 	movesb	(%a0)+,%d1	| fetch user byte
 	moveb	%d1,(%a1)+	| write supervisor byte
 	dbf	%d0,morein
 	rts
 
-	.section .fixup,#alloc,#execinstr
-	.even
-1:
-	jbra	fpsp040_die
-
-	.section __ex_table,#alloc
-	.align	4
-
-	.long	in_ea,1b
-	.long	out_ea,1b
-
 	|end
diff --git a/arch/m68k/kernel/traps.c b/arch/m68k/kernel/traps.c
index e2a6f3556211..3ec6ae1bdaf9 100644
--- a/arch/m68k/kernel/traps.c
+++ b/arch/m68k/kernel/traps.c
@@ -1144,15 +1144,6 @@ asmlinkage void set_esp0(unsigned long ssp)
 	current->thread.esp0 = ssp;
 }
 
-/*
- * This function is called if an error occur while accessing
- * user-space from the fpsp040 code.
- */
-asmlinkage void fpsp040_die(void)
-{
-	do_exit(SIGSEGV);
-}
-
 #ifdef CONFIG_M68KFPU_EMU
 asmlinkage void fpemu_signal(int signal, int code, void *addr)
 {

Eric