Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall

"Andy Lutomirski" <luto@xxxxxxxxxx> · Thu, 30 Sep 2021 11:08:35 -0700

On Tue, Sep 28, 2021, at 9:56 PM, Sohil Mehta wrote:
> On 9/28/2021 8:30 PM, Andy Lutomirski wrote:
>> On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:
>>> Add a new system call to allow applications to block in the kernel and
>>> wait for user interrupts.
>>>
>> ...
>>
>>> When the application makes this syscall the notification vector is
>>> switched to a new kernel vector. Any new SENDUIPI will invoke the kernel
>>> interrupt which is then used to wake up the process.
>> Any new SENDUIPI that happens to hit the target CPU's ucode at a time when the kernel vector is enabled will deliver the interrupt.  Any new SENDUIPI that happens to hit the target CPU's ucode at a time when a different UIPI-using task is running will *not* deliver the interrupt, unless I'm missing some magic.  Which means that wakeups will be missed, which I think makes this whole idea a nonstarter.
>>
>> Am I missing something?
>
>
> The current kernel implementation reserves 2 notification vectors (NV) 
> for the 2 states of a thread (running vs blocked).
>
> NV-1 – used only for tasks that are running. (results in a user 
> interrupt or a spurious kernel interrupt)
>
> NV-2 – used only for a tasks that are blocked in the kernel. (always 
> results in a kernel interrupt)
>
> The UPID.UINV bits are switched between NV-1 and NV-2 based on the state 
> of the task.

Aha, cute.  So NV-1 is only sent if the target is directly paying attention and, assuming all the atomics are done right, NV-2 will be sent for tasks that are asleep.

Logically, I think these are the possible states for a receiving task:

1. Running.  SENDUIPI will actually deliver the event directly (or not if uintr is masked).  If the task just stopped running and the atomics are right, then the schedule-out code can, I think, notice.

2. Not running, but either runnable or not currently waiting for uintr (e.g. blocked in an unrelated syscall).  This is straightforward -- no IPI or other action is needed other than setting the uintr-pending bit.

3. Blocked and waiting for uintr.  For this to work right, anyone trying to send with SENDUIPI (or maybe a vdso or similar clever wrapper around it) needs to result in either a fault or an IPI so the kernel can process the wakeup.

(Note that, depending on how fancy we get with file descriptors and polling, we need to watch out for the running-and-also-waiting-for-kernel-notification state.  That one will never work right.)

3 is the nasty case, and your patch makes it work with this NV-2 trick.  The trick is a bit gross for a couple reasons.  First, it conveys no useful information to the kernel except that an unknown task did SENDUIPI and maybe that the target was most recently on a given CPU.  So a big list search is needed.  Also, it hits an essentially arbitrary and possibly completely innocent victim CPU and task, and people doing any sort of task isolation workload will strongly dislike this.  For some of those users, "strongly" may mean "treat system as completely failed, fail over to something else and call expensive tech support."  So we can't do that.

I think we have three choices:

Use a fancy wrapper around SENDUIPI.  This is probably a bad idea.

Treat the NV-2 as a real interrupt and honor affinity settings.  This will be annoying and slow, I think, if it's even workable at all.

Handle this case with faults instead of interrupts.  We could set a reserved bit in UPID so that SENDUIPI results in #GP, decode it, and process it.  This puts the onus on the actual task causing trouble, which is nice, and it lets us find the UPID and target directly instead of walking all of them.  I don't know how well it would play with hypothetical future hardware-initiated uintrs, though.