Re: [PATCH v2 0/3] futex: Create set_robust_list2

Florian Weimer <fweimer@xxxxxxxxxx> · Mon, 04 Nov 2024 13:36:43 +0100

* Peter Zijlstra:

> On Sat, Nov 02, 2024 at 10:58:42PM +0100, Florian Weimer wrote:
>
>> QEMU hints towards further problems (in linux-user/syscall.c):
>> 
>>     case TARGET_NR_set_robust_list:
>>     case TARGET_NR_get_robust_list:
>>         /* The ABI for supporting robust futexes has userspace pass
>>          * the kernel a pointer to a linked list which is updated by
>>          * userspace after the syscall; the list is walked by the kernel
>>          * when the thread exits. Since the linked list in QEMU guest
>>          * memory isn't a valid linked list for the host and we have
>>          * no way to reliably intercept the thread-death event, we can't
>>          * support these. Silently return ENOSYS so that guest userspace
>>          * falls back to a non-robust futex implementation (which should
>>          * be OK except in the corner case of the guest crashing while
>>          * holding a mutex that is shared with another process via
>>          * shared memory).
>>          */
>>         return -TARGET_ENOSYS;
>
> I don't think we can sanely fix that. Can't QEMU track the robust thing
> itself and use waitpid() to discover the thread is gone and fudge things
> from there?

There are race conditions with munmap, I think, and they probably get a
lot of worse if QEMU does that.

See Rich Felker's bug report:

| The corruption is performed by the kernel when it walks the robust
| list. The basic situation is the same as in PR #13690, except that
| here there's actually a potential write to the memory rather than just
| a read.
| 
| The sequence of events leading to corruption goes like this:
| 
| 1. Thread A unlocks the process-shared, robust mutex and is preempted
|    after the mutex is removed from the robust list and atomically
|    unlocked, but before it's removed from the list_op_pending field of
|    the robust list header.
| 
| 2. Thread B locks the mutex, and, knowing by program logic that it's
|    the last user of the mutex, unlocks and unmaps it, allocates/maps
|    something else that gets assigned the same address as the shared mutex
|    mapping, and then exits.
| 
| 3. The kernel destroys the process, which involves walking each
|   thread's robust list and processing each thread's list_op_pending
|   field of the robust list header. Since thread A has a list_op_pending
|   pointing at the address previously occupied by the mutex, the kernel
|   obliviously "unlocks the mutex" by writing a 0 to the address and
|   futex-waking it. However, the kernel has instead overwritten part of
|   whatever mapping thread A created. If this is private memory it
|   (probably) doesn't matter since the process is ending anyway (but are
|   there race conditions where this can be seen?). If this is shared
|   memory or a shared file mapping, however, the kernel corrupts it.
| 
| I suspect the race is difficult to hit since thread A has to get
| preempted at exactly the wrong time AND thread B has to do a fair
| amount of work without thread A getting scheduled again. So I'm not
| sure how much luck we'd have getting a test case.

<https://sourceware.org/bugzilla/show_bug.cgi?id=14485#c3>

We also have a silent unlocking failure because userspace does not know
about ROBUST_LIST_LIMIT:

  Bug 19089 - Robust mutexes do not take ROBUST_LIST_LIMIT into account
  <https://sourceware.org/bugzilla/show_bug.cgi?id=19089>

(I think we may have discussed this one before, and you may have
suggested to just hard-code 2048 in userspace because the constant is
not expected to change.)

So the in-mutex linked list has quite a few problems even outside of
emulation. 8-(

Thanks,
Florian