* Peter Zijlstra: > On Sat, Nov 02, 2024 at 10:58:42PM +0100, Florian Weimer wrote: > >> QEMU hints towards further problems (in linux-user/syscall.c): >> >> case TARGET_NR_set_robust_list: >> case TARGET_NR_get_robust_list: >> /* The ABI for supporting robust futexes has userspace pass >> * the kernel a pointer to a linked list which is updated by >> * userspace after the syscall; the list is walked by the kernel >> * when the thread exits. Since the linked list in QEMU guest >> * memory isn't a valid linked list for the host and we have >> * no way to reliably intercept the thread-death event, we can't >> * support these. Silently return ENOSYS so that guest userspace >> * falls back to a non-robust futex implementation (which should >> * be OK except in the corner case of the guest crashing while >> * holding a mutex that is shared with another process via >> * shared memory). >> */ >> return -TARGET_ENOSYS; > > I don't think we can sanely fix that. Can't QEMU track the robust thing > itself and use waitpid() to discover the thread is gone and fudge things > from there? There are race conditions with munmap, I think, and they probably get a lot of worse if QEMU does that. See Rich Felker's bug report: | The corruption is performed by the kernel when it walks the robust | list. The basic situation is the same as in PR #13690, except that | here there's actually a potential write to the memory rather than just | a read. | | The sequence of events leading to corruption goes like this: | | 1. Thread A unlocks the process-shared, robust mutex and is preempted | after the mutex is removed from the robust list and atomically | unlocked, but before it's removed from the list_op_pending field of | the robust list header. | | 2. Thread B locks the mutex, and, knowing by program logic that it's | the last user of the mutex, unlocks and unmaps it, allocates/maps | something else that gets assigned the same address as the shared mutex | mapping, and then exits. | | 3. The kernel destroys the process, which involves walking each | thread's robust list and processing each thread's list_op_pending | field of the robust list header. Since thread A has a list_op_pending | pointing at the address previously occupied by the mutex, the kernel | obliviously "unlocks the mutex" by writing a 0 to the address and | futex-waking it. However, the kernel has instead overwritten part of | whatever mapping thread A created. If this is private memory it | (probably) doesn't matter since the process is ending anyway (but are | there race conditions where this can be seen?). If this is shared | memory or a shared file mapping, however, the kernel corrupts it. | | I suspect the race is difficult to hit since thread A has to get | preempted at exactly the wrong time AND thread B has to do a fair | amount of work without thread A getting scheduled again. So I'm not | sure how much luck we'd have getting a test case. <https://sourceware.org/bugzilla/show_bug.cgi?id=14485#c3> We also have a silent unlocking failure because userspace does not know about ROBUST_LIST_LIMIT: Bug 19089 - Robust mutexes do not take ROBUST_LIST_LIMIT into account <https://sourceware.org/bugzilla/show_bug.cgi?id=19089> (I think we may have discussed this one before, and you may have suggested to just hard-code 2048 in userspace because the constant is not expected to change.) So the in-mutex linked list has quite a few problems even outside of emulation. 8-( Thanks, Florian