Re: futex wait failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> > I tested the patch and the testcase in
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=561203
> > still segfaults.
> 
> I think the expect/tcl bug and the bug 561203 are related.  Looking
> at the minifail core dump, I see:
> 
> Core was generated by `./minifail'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00000000 in ?? ()
> 
> So, how did we get to 0?  $rp is 0, so we might have executed a
> return to this location.  $r31 conains 0x4157cc4f.
> 
> (gdb) disass 0x4157cc3c 0x4157cc5c
> Dump of assembler code from 0x4157cc3c to 0x4157cc5c:
> 0x4157cc3c <_IO_puts+332>:      copy rp,r25
> 0x4157cc40 <_IO_puts+336>:      copy r6,r24
> 0x4157cc44 <_IO_puts+340>:      be,l b0(sr2,r0),sr0,r31
> 0x4157cc48 <_IO_puts+344>:      ldi 0,r20
> 0x4157cc4c <_IO_puts+348>:      ldi -b,r24
> 0x4157cc50 <_IO_puts+352>:      cmpb,=,n r24,r21,0x4157cc38 <_IO_puts+328>
> 0x4157cc54 <_IO_puts+356>:      nop
> 0x4157cc58 <_IO_puts+360>:      ldi -2d,r25


I think I have an idea what could have happened and why it most of the times (but not always) crashes in the child process...

In ports/sysdeps/unix/sysv/linux/hppa/bits/atomic.h we have:
#define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
  ({                                                                    \
     volatile int lws_errno;                                            \
     volatile int lws_ret;                                              \
     asm volatile(                                                      \
...some assembly...
        "stw    %%r28, %0                       \n\t"                   \
        "sub    %%r0, %%r21, %%r21              \n\t"                   \
        "stw    %%r21, %1                       \n\t"                   \
        : "=m" (lws_ret), "=m" (lws_errno)                              \
        : "r" (mem), "r" (oldval), "r" (newval)                         \
        : _LWS_CLOBBER           

this means, that lws_errno and lws_ret are located on the stack.

With gdb I see this expanded to:
0x40705494 <start_thread+1204>: stw ret0,-1b8(sp)
0x40705498 <start_thread+1208>: sub r0,r21,r21
0x4070549c <start_thread+1212>: stw r21,-1b4(sp)

So, lws_ret/lws_errno are at -1b8/-1b4(sp).

And this LWS code is called from 
../nptl/sysdeps/pthread/createthread.c:
static int create_thread (struct pthread *pd, const struct pthread_attr *attr, STACK_VARIABLES_PARMS)
...
          int res = do_clone (pd, attr, clone_flags, start_thread,
                              STACK_VARIABLES_ARGS, 1);
          if (res == 0)
            {
...(line 216):
              /* Enqueue the descriptor.  */
              do
                pd->nextevent = __nptl_last_event;
              while (atomic_compare_and_exchange_bool_acq(&__nptl_last_event, pd, pd->nextevent) != 0);


And here is what could have happened:
a) do_clone() creates the child process.
b) the child process gets a new stack
c) the child calls atomic_compare_and_exchange_bool_acq() and thus the LWS code above.
d) the LWS code writes to the stack location at -1b8(sp), which is out of bounds for the child process (the child stack got only ~ 0x40 bytes initial room)
e) Thus the child either crashes, overwrites memory of the parent or does other things wrong.

Additionally:
Due to the LWS assembly code and because we don't have many registers free while using LWS, gcc used %rp as a temporary register which may have fooled us in our thinking?

0x40705458 <start_thread+1144>: ldi 0,rp
0x4070545c <start_thread+1148>: ldi fb,r3
0x40705460 <start_thread+1152>: ldw -70(sp),ret0
0x40705464 <start_thread+1156>: ldw 214(ret0),ret1
0x40705468 <start_thread+1160>: copy r5,r26
0x4070546c <start_thread+1164>: copy ret1,r25
0x40705470 <start_thread+1168>: copy rp,r24
0x40705474 <start_thread+1172>: be,l b0(sr2,r0),sr0,r31
0x40705478 <start_thread+1176>: ldi 0,r20
0x4070547c <start_thread+1180>: ldi -b,r24
0x40705480 <start_thread+1184>: cmpb,=,n r24,r21,0x40705468 <start_thread+1160>
0x40705484 <start_thread+1188>: nop
0x40705488 <start_thread+1192>: ldi -2d,r25
0x4070548c <start_thread+1196>: cmpb,=,n r25,r21,0x40705468 <start_thread+1160>
0x40705490 <start_thread+1200>: nop
0x40705494 <start_thread+1204>: stw ret0,-1b8(sp)
0x40705498 <start_thread+1208>: sub r0,r21,r21
0x4070549c <start_thread+1212>: stw r21,-1b4(sp)
0x407054a0 <start_thread+1216>: ldw -1b4(sp),ret0


If my assumptions are correct, then we either could

a) use the gcc atomic builtins instead of own atomic code in libc6:
E.g: add to ports/sysdeps/unix/sysv/linux/hppa/bits/atomic.h:
...
#if __GNUC_PREREQ (4, 1)
# define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
  __sync_val_compare_and_swap (mem, oldval, newval)
#  define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
  (! __sync_bool_compare_and_swap (mem, oldval, newval))

#elif __ASSUME_LWS_CAS
....

b) change the assembly in 
atomic_compare_and_exchange_val_acq()
to not put it's local variables (lws_errno and lws_ret) on the stack.

I'm currently testing option a).

Helge
(PS: I used a webmailer, so the indenting might be strange...)
-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux SoC]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux