On Wed, Oct 09 2024 at 09:36, Juri Lelli wrote: > On 08/10/24 12:59, André Almeida wrote: >> > > There's this work from Thomas that aims to solve corner cases like this, by >> > > giving apps the option to instead of using the global hash table, to have >> > > their own allocated wait queue: >> > > https://lore.kernel.org/lkml/20160402095108.894519835@xxxxxxxxxxxxx/ >> > > >> > > "Collisions on that hash can lead to performance degradation >> > > and on real-time enabled kernels to unbound priority inversions." >> > >> > This is correct. The problem is also that the hb lock is hashed on >> > several things so if you restart/ reboot you may no longer share the hb >> > lock with the "bad" application. >> > >> > Now that I think about it, of all things we never tried a per-process >> > (shared by threads) hb-lock which could also be hashed. This would avoid >> > blocking on other applications, your would have to blame your own threads. > > Would this be somewhat similar to what Linus (and Ingo IIUC) were > inclined to suggesting from the thread above (edited)? > > --- > So automatically using a local hashtable according to some heuristic is > definitely the way to go. And yes, the heuristic may be well be - at > least to start - "this is a preempt-RT system" (for people who clearly > care about having predictable latencies) or "this is actually a > multi-node NUMA system, and I have heaps of memory" > --- > > So, make it per-process local by default on PREEMPT_RT and NUMA? I somehow did not have cycles to follow up on that proposal back then and consequently forgot about it :( To make this sane, per process has to be restricted to process private futexes. That's a reasonable restriction IMO and completely avoids the global state dance which we implemented back then. I just digged up my old notes. Let me dump some thoughts. 1) The reason for the attachment syscall was to avoid latency on first usage, which can be far into the application lifetime because the kernel only learns about the futex when there is contention. For most scenarios this should be a non-issue because allocating a small hash table is usually not a problem, especially if you use a dedicated kmem_cache for it. Under memory pressure, that's a different issue, but a RT system should not get there in the first place. But for RT systems this might matter. Though we can be clever about it and allow preallocation of the per process hash table via a TBD sys_futex_init_private_hash() syscall or a prctl(). 2) We aimed for zero collision back then by making this a indexed based mechanism. Though there was an open question how to limit the maximum table size and from my notes there was some insane number of entries required by some heavily threaded enterprise Java muck which used a gazillion of futexes... We need some sane default/maximum sizing of the per-process hash table which can be adjusted by the sysadmin. Whether the proper mechanism is a syscall audit, which includes prctl(), or a UID/GID based rlimit does not matter much. That's a question for system admins/configurators to answer. Hope that helps. Thanks, tglx