Frederic Weisbecker <frederic@xxxxxxxxxx> wrote: > Hi, > > I'm iterating again on this topic, this time with the author of > the patch Cc'ed. > > The following commit: > > a218cc491420 (epoll: use rwlock in order to reduce ep_poll > callback() contention) > > has changed the ep->lock into an rwlock. This can cause priority inversion > on PREEMPT_RT. Here is an example: > > > 1) High priority task A waits for events on epoll_wait(), nothing shows up so > it goes to sleep for new events in the ep_poll() loop. > > 2) Lower prio task B brings new events in ep_poll_callback(), waking up A > while still holding read_lock(ep->lock) > > 3) Task A wakes up immediately, tries to grab write_lock(ep->lock) but it has > to wait for task B to release read_lock(ep->lock). Unfortunately there is > no priority inheritance when write_lock() is called on an rwlock that is > already read_lock'ed. So back to task B that may even be preempted by > yet another task before releasing read_lock(ep->lock). > > > Now how to solve this? Several possibilities: > > == Delay the wake up after releasing the read_lock()? == > > That solves part of the problem only. If another event comes up > concurrently we are back to the original issue. > > == Make rwlock more fair ? == > > Currently read_lock() only acquires the rtmutex if the lock is already > write-held (or write_lock() is waiting to acquire). So if read_lock() happens > after write_lock(), fairness is observed but if write_lock() happens after > read_lock(), priority inheritance doesn't happen. > > I think there has been attempts to solve this by the past but some issues > arised (don't know the exact details, comments on rwbase_rt.c bring some clues). > > == Convert the rwlock to RCU ? == > > Traditionally, we try to convert rwlocks bringing issues to RCU. I'm not sure the > situation fits here because the rwlock is used the other way around: > the epoll consumer does the write_lock() and the producers do read_lock(). Then > concurrent producers use ad-hoc concurrent list add (see list_add_tail_lockless) > to handle racy modifications. > > There are also list modifications on both side. There are added from the > producers and read and deleted (even re-added sometimes) on the consumer side. > > Perhaps RCU could be used with keeping locking on the consumer side... +CC linux-fsdevel and Mathieu Desnoyers I proposed using wfcqueue many years ago, but ran out of time/hardware/funding to work on it: https://yhbt.net/lore/lkml/20130401183118.GA9968@xxxxxxxxxxxxx/ wfcqueue is used internally by Userspace-RCU, but wfcqueue itself doesn't rely on RCU. I'm not sure if wfcqueue helps PREEMPT_RT, but Mathieu + Paul might. > == Convert to llist ? == > > It's a possibility but some operations like single element deletion may be > costly because only llist_add() and llist_del_all() are atomic on llist. > !CONFIG_PREEMPT_RT might not be happy about it. > > == Consider epoll not PREEMPT_RT friendly? == > > A last resort is to simply consider epoll is not RT-friendly and suggest > using more simple alternatives like poll().... > > Any thoughts?