Re: soft and hard float n32 and n64 binaries getting illegal instructions on Cobalt Qube2

Florian Fainelli <f.fainelli@xxxxxxxxx> · Tue, 8 Aug 2023 12:43:10 -0700

On 8/8/23 04:07, Maciej W. Rozycki wrote:
On Tue, 8 Aug 2023, Maciej W. Rozycki wrote:

Looks like GDB is not too happy with the core file I obtained, see below. Is
this a known issue with gdb-12.1? gdb-11.x was not faring any better
unfortunately. The core dump is attached (hope it makes it to the mailing
list).

  Sigh.  Back in 2017 I fixed numerous issues with MIPS core file handling
in GDB, which was then well-tested, and I haven't looked at it since.  So
it must be a regression, either in GDB or in the producer (Linux kernel).

  I can reproduce it and I'll see if I can debug this.  I may ask you for
the corresponding binary and shared libraries at one point.

  NB I note that the core file is ELF32 and does not have the EF_MIPS_ABI2
flag set in its file header, so it looks like an o32 core file to me.
Since your subject mentions n32/n64, can you please check what kind of
binary your `iperf3' is (e.g. `file iperf3', `readelf -h iperf3', etc.)?

file target/usr/bin/iperf3
target/usr/bin/iperf3: ELF 32-bit LSB pie executable, MIPS, N32 MIPS64 
version 1 (SYSV), dynamically linked, interpreter 
/lib/ld-musl-mipsn32el.so.1, stripped
readelf -h target/usr/bin/iperf3
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent 
Executable file)
  Machine:                           MIPS R3000
  Version:                           0x1
  Entry point address:               0xa90
  Start of program headers:          52 (bytes into file)
  Start of section headers:          66084 (bytes into file)
  Flags:                             0x60000027, noreorder, pic, cpic, 
abi2, mips64
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         10
  Size of section headers:           40 (bytes)
  Number of section headers:         24
  Section header string table index: 23

  I tweaked the core file and set the EF_MIPS_ABI2 flag by hand.  This has
let GDB proceed:

[...]
warning: File "/usr/lib/debug/.build-id" has no build-id, file skipped
Core was generated by `iperf3 -c 192.168.254.3'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x77eb3190 in ?? (
[...]

but this is a partial core file only and I have no corresponding binaries,
so I can't tell more at this stage, i.e. what code there is at 0x77eb3190
(the PC address is suspiciously high in the 32-bit address space BTW, is
that a PIE run with address space randomisation?).

Yes, this is the case, PIC/PIE is enabled, -fstack-protector-strong as 
well as FULL RelRO are enabled. You can find the kernel, root 
filesystem, GCC -dumpspecs and core dump here:

https://github.com/ffainelli/cobalt-crash/

  Of course GDB shouldn't have hit an internal error with a broken core
file anyway, it should complain and handle the situation gracefully.  But
in any case it's an issue with the Linux kernel producing a core file in
the wrong format.

  Is this the most recent Linux kernel version you have obtained this with?
There used to be a bug (regression) in Linux with n32 core files, but I
fixed it also in 2017, with commit 547da673173d ("MIPS: Fix an n32 core
file generation regset support regression").  So if it has stopped working
again, then it's a new regression.  Would you be able to bisect it?

Looks like this core dump was collected with Linux 4.14 which did not 
include your commit. Tried again with Linus' tip of the tree 
(v6.5-rc5-50-g02aee814d37c) and GDB had no problems loading the core 
dump this time. At least that part seems to be checked out.

Now GDB tells me the faulting instruction is the following:

#0  0x77dcf190 in _dlstart_data () from target/lib/ld-musl-mipsn32el.so.1
#1  0x77e28bfc in ?? () from target/lib/ld-musl-mipsn32el.so.1
(gdb) display/i $pc
2: x/i $pc
=> 0x77dcf190 <_dlstart_data+17976>:    dclz    v1,a0
(gdb) info reg
                  zero               at               v0               v1
 R0   0000000000000000 ffffffff9400ece0 0000000014000000 0000000004000004
                    a0               a1               a2               a3
 R4   0000000014000000 0000000004000004 0001000000000000 0000000004000004
                    a4               a5               a6               a7
 R8   0000000000000000 0000000000000000 000000007fc45b78 0000000000000000
                    t0               t1               t2               t3
 R12  0000000000000000 0000000000000000 0000005000000000 0000000000000000
                    s0               s1               s2               s3
 R16  0000000000000002 000000007fc47b90 401b400000000000 0000000000000004
                    s4               s5               s6               s7
 R20  0000000000000000 0000000077e7a781 0000000000000066 0000000000000000
                    t8               t9               k0               k1
 R24  0000000000000000 0000000077dcf180 0000000000000000 0000000000000000
                    gp               sp               s8               ra
 R28  0000000077ea0290 000000007fc45b10 000000007fc45b78 0000000077e28bfc
                    sr               lo               hi              bad
      ffffffffa400ecf3 0000000000000000 0000000000000000 000000007fc45b58
                 cause               pc
      0000000000800028 0000000077dcf190
                   fsr              fir
              00800004         000028a0
(gdb) display/i 0x14000000
2: x/i 0x14000000
   0x14000000:  <error: Cannot access memory at address 0x14000000>

curious...
--
Florian