Re: Reproducer for the posix_spawn() bug on sparc64

Michael Karcher <kernel@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Mon, 12 Feb 2024 19:14:46 +0100

Am 12.02.2024 um 18:31 schrieb Adhemerval Zanella Netto:
On 12/02/24 13:32, John Paul Adrian Glaubitz wrote:
Hi Adhemerval,

On Mon, 2024-02-12 at 11:01 -0300, Adhemerval Zanella Netto wrote:
It fails on the two different sparc64 machines I usually use for glibc testing as well:

azanella@catbus ~ $ /lib64/libc.so.6 | head -n 1
GNU C Library (Gentoo 2.38-r9 (patchset 9)) stable release version 2.38.
azanella@catbus ~ $ uname -a
Linux catbus.sparc.dev.gentoo.org 6.1.72 #1 SMP Fri Jan 12 15:00:51 PST 2024 sparc64 sun4v UltraSparc T5 (Niagara5) GNU/Linux
azanella@catbus ~ $ ./more_clone_attack
effective FP in clone() with waste 0 = 7feffee09f0
this is 318 64-bit words above the next page boundary
clone: Bad address
Problem detected at 1 pages distance

azanella@ravirin:~$ /lib/sparc64-linux-gnu/libc.so.6 | head -n 1
GNU C Library (Debian GLIBC 2.37-15) stable release version 2.37.
azanella@ravirin:~$ uname -a
Linux ravirin 4.19.0-5-sparc64 #1 Debian 4.19.37-6 (2019-07-18) sparc64 GNU/Linux
azanella@ravirin:~$ ./more_clone_attack
effective FP in clone() with waste 0 = 7feffa3ae50
this is 458 64-bit words above the next page boundary
clone: Bad address
Problem detected at 1 pages distance

And I see similar failures on qemu as well.
Thanks for the confirmation. I was also able to reproduce it even on Debian Wheezy
with kernel 3.2.0 and glibc 2.13, so it seems the bug is very old.

Do you think it's a kernel or glibc bug?

Adrian

I am not sure, I was leaning to some clone change in recent changes; but since you did
see on version as old as 2.13 I don't think it is related to glibc clone implementation.

It really make me to believe it is something related to kernel because running the regression
program under strace I could not trigger the issue; nor by trying adding a printf just before
clone call.

I did some root cause analysis. I *know* that the issue happens when %sp points into
uncommitted memory on the stack when the system call is invoked. If you add a printf
after the variable-length array has been reserved on the stack, you cause the target
stack page to be faulted in, so %sp is no longer hovering over uncommitted memory.
That's what the +/-22 comes from: I target to get %sp in call_clone (that is %fp in
clone) aligned to a page boundary. clone then reserves 24 64-bit-words on the stack
(without touching them). If the page boundary that %fp hovers over is
the lowest-address committed stack page, %sp will get into (yet) uncommitted memory.
wasting 24 words less make %sp get to the bottom of the last committed page, so the
issue does not appear. wasting at least 24 extra words cause the 7th argument to
clone to appear on the yet uncommitted page, and generates a page fault that commits
this page before clone is invoked.

Now that's the point where the guesswork starts: the kernel entry for clone, vfork and
fork issues "flushw" to flush the register windows to the stack. In the problematic
situation, this will hit address space without a committed page behind it. If I understand
the save trap handler in the kernel correctly, it detects that it is called from
kernel-space, and that the saving happens to user-space memory. In that case, the kernel
*disables* MMU fault traps, tries the saving, and the checks whether some writes got
dropped due to a fault by checking a MMU status flag. In that case, the kernels saves
the register into some backup location, because the kernel requires that saving the
user-space register to stack works, even if the user-mode stack is "bolixed".

Now, clone clones (pun intended) the frame of the caller into the stack of the new
"thread" (let's not argue whether the thing clone creates is a "lightweight procss",
a "thread", an "execution flow of some indetermined kind" or whatever), which is the
area between %sp of the caller frame and %fp of the caller frame. I guess the call
goes haywire at the point when %sp points to the backup location, but %fp points to
the user-mode stack (or possibly some different backup location), and "the area
between %sp and %fp" is no longer a well-defined memory range.

Let me know if you want file names / line numbers into the kernel source to back up
the facts and guesswork I wrote.

Kind regards,
  Michael Karcher