Re: [PATCH v4 00/30] NT synchronization primitive driver

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Wed, 17 Apr 2024 12:01:32 +0200

On Wed, Apr 17, 2024 at 01:05:47AM -0500, Elizabeth Figura wrote:

> Here's a (slightly ad-hoc) simplification of the patch into text form inlined 
> into this message; hopefully it's readable enough.

Thanks!

Still needed:

 s/\`\`/"/g
 s/\.\.\ //g

But then it's readable

> 
> ===================================
> NT synchronization primitive driver
> ===================================
> 
> This page documents the user-space API for the ntsync driver.
> 
> ntsync is a support driver for emulation of NT synchronization
> primitives by user-space NT emulators. It exists because implementation
> in user-space, using existing tools, cannot match Windows performance
> while offering accurate semantics. It is implemented entirely in
> software, and does not drive any hardware device.
> 
> This interface is meant as a compatibility tool only, and should not
> be used for general synchronization. Instead use generic, versatile
> interfaces such as futex(2) and poll(2).
> 
> Synchronization primitives
> ==========================
> 
> The ntsync driver exposes three types of synchronization primitives:
> semaphores, mutexes, and events.
> 
> A semaphore holds a single volatile 32-bit counter, and a static 32-bit
> integer denoting the maximum value. It is considered signaled when the
> counter is nonzero. The counter is decremented by one when a wait is
> satisfied. Both the initial and maximum count are established when the
> semaphore is created.
> 
> A mutex holds a volatile 32-bit recursion count, and a volatile 32-bit
> identifier denoting its owner. A mutex is considered signaled when its
> owner is zero (indicating that it is not owned). The recursion count is
> incremented when a wait is satisfied, and ownership is set to the given
> identifier.

'signaled' is used twice now but not defined. For both Semaphore and
Mutex this seems to indicate uncontended? Edit: seems to be needs-wakeup
more than uncontended.

> A mutex also holds an internal flag denoting whether its previous owner
> has died; such a mutex is said to be abandoned. Owner death is not
> tracked automatically based on thread death, but rather must be
> communicated using NTSYNC_IOC_MUTEX_KILL. An abandoned mutex is
> inherently considered unowned.
> 
> Except for the "unowned" semantics of zero, the actual value of the
> owner identifier is not interpreted by the ntsync driver at all. The
> intended use is to store a thread identifier; however, the ntsync
> driver does not actually validate that a calling thread provides
> consistent or unique identifiers.

Why not verify it? Seems simple enough to put in a TID check, esp. if NT
mandates the same.

> An event holds a volatile boolean state denoting whether it is signaled
> or not. There are two types of events, auto-reset and manual-reset. An
> auto-reset event is designaled when a wait is satisfied; a manual-reset
> event is not. The event type is specified when the event is created.

But what is an event? I'm familiar with semaphores and mutexes, but less
so with events.

> Unless specified otherwise, all operations on an object are atomic and
> totally ordered with respect to other operations on the same object.
> 
> Objects are represented by files. When all file descriptors to an
> object are closed, that object is deleted.
> 
> Char device
> ===========
> 
> The ntsync driver creates a single char device /dev/ntsync. Each file
> description opened on the device represents a unique instance intended
> to back an individual NT virtual machine. Objects created by one ntsync
> instance may only be used with other objects created by the same
> instance.
> 
> ioctl reference
> ===============
> 
> All operations on the device are done through ioctls. There are four
> structures used in ioctl calls::
> 
>    struct ntsync_sem_args {
>        __u32 sem;
>        __u32 count;
>        __u32 max;
>    };
> 
>    struct ntsync_mutex_args {
>        __u32 mutex;
>        __u32 owner;
>        __u32 count;
>    };
> 
>    struct ntsync_event_args {
>        __u32 event;
>        __u32 signaled;
>        __u32 manual;
>    };
> 
>    struct ntsync_wait_args {
>        __u64 timeout;
>        __u64 objs;
>        __u32 count;
>        __u32 owner;
>        __u32 index;
>        __u32 alert;
>        __u32 flags;
>        __u32 pad;
>    };
> 
> Depending on the ioctl, members of the structure may be used as input,
> output, or not at all. All ioctls return 0 on success.
> 
> The ioctls on the device file are as follows:
> 
> NTSYNC_IOC_CREATE_SEM
> 
>   Create a semaphore object. Takes a pointer to struct ntsync_sem_args,
>   which is used as follows:
> 
>      * sem:   On output, contains a file descriptor to the created semaphore.
>      * count: Initial count of the semaphore.
>      * max:   Maximum count of the semaphore.
> 
>   Fails with EINVAL if `count` is greater than `max`.

So the implication is that @count and @max are input argument and as
such should be set before calling the ioctl()?

It would not have been weird to have the ioctl() return the fd on
success I suppose, instead of mixing input and output arguments like
this, but whatever, this works.

> NTSYNC_IOC_CREATE_MUTEX
> 
>   Create a mutex object. Takes a pointer to struct ntsync_mutex_args,
>   which is used as follows:
> 
>      * mutex: On output, contains a file descriptor to the created mutex.
>      * count: Initial recursion count of the mutex.
>      * owner: Initial owner of the mutex.
> 
>   If "owner" is nonzero and "count" is zero, or if "owner" is zero
>   and "count" is nonzero, the function fails with EINVAL.
> 
> NTSYNC_IOC_CREATE_EVENT
> 
>   Create an event object. Takes a pointer to struct ntsync_event_args,
>   which is used as follows:
> 
>      * event:    On output, contains a file descriptor to the created event.
>      * signaled: If nonzero, the event is initially signaled, otherwise
>                  nonsignaled.
>      * manual:   If nonzero, the event is a manual-reset event, otherwise
>                  auto-reset.
> 

Still mystified as to what event actually is, perhaps more clues
below...

> The ioctls on the individual objects are as follows:
> 
> NTSYNC_IOC_SEM_POST
> 
>   Post to a semaphore object. Takes a pointer to a 32-bit integer,
>   which on input holds the count to be added to the semaphore, and on
>   output contains its previous count.
> 
>   If adding to the semaphore's current count would raise the latter
>   past the semaphore's maximum count, the ioctl fails with
>   EOVERFLOW and the semaphore is not affected. If raising the
>   semaphore's count causes it to become signaled, eligible threads
>   waiting on this semaphore will be woken and the semaphore's count
>   decremented appropriately.

Urg, so this is the traditional V (vrijgeven per Dijkstra, release in
English), but now 'conveniently' called POST, such that it can be
readily confused with the P operation (passering, or passing) which it
is not.

Glorious :-/

You're of course going to tell me NT did this and you can't help this
naming foible.

> NTSYNC_IOC_MUTEX_UNLOCK
> 
>   Release a mutex object. Takes a pointer to struct ntsync_mutex_args,
>   which is used as follows:
> 
>      * mutex: Ignored.
>      * owner: Specifies the owner trying to release this mutex.
>      * count: On output, contains the previous recursion count.
> 
>   If "owner" is zero, the ioctl fails with EINVAL. If "owner"
>   is not the current owner of the mutex, the ioctl fails with
>   EPERM.

ISTR you having written elsewhere that NT actually demands mutexes to be
strictly per thread, which for the above would mandate @owner to be
current, no?

>   The mutex's count will be decremented by one. If decrementing the
>   mutex's count causes it to become zero, the mutex is marked as
>   unowned and signaled, and eligible threads waiting on it will be
>   woken as appropriate.
> 
> NTSYNC_IOC_SET_EVENT
> 
>   Signal an event object. Takes a pointer to a 32-bit integer, which on
>   output contains the previous state of the event.
> 
>   Eligible threads will be woken, and auto-reset events will be
>   designaled appropriately.

Hmm, so the event thing is like a simple wait-wake scheme? Where the
'signaled' bit is used as the wakeup state?

> NTSYNC_IOC_RESET_EVENT
> 
>   Designal an event object. Takes a pointer to a 32-bit integer, which
>   on output contains the previous state of the event.
> 
> NTSYNC_IOC_PULSE_EVENT
> 
>   Wake threads waiting on an event object while leaving it in an
>   unsignaled state. Takes a pointer to a 32-bit integer, which on
>   output contains the previous state of the event.
> 
>   A pulse operation can be thought of as a set followed by a reset,
>   performed as a single atomic operation. If two threads are waiting on
>   an auto-reset event which is pulsed, only one will be woken. If two
>   threads are waiting a manual-reset event which is pulsed, both will
>   be woken. However, in both cases, the event will be unsignaled
>   afterwards, and a simultaneous read operation will always report the
>   event as unsignaled.

*groan*

> NTSYNC_IOC_READ_SEM
> 
>   Read the current state of a semaphore object. Takes a pointer to
>   struct ntsync_sem_args, which is used as follows:
> 
>      * sem:   Ignored.
>      * count: On output, contains the current count of the semaphore.
>      * max:   On output, contains the maximum count of the semaphore.

This seems inherently racy -- what is the intended purpose of this
interface?

Specifically the moment a value is returned, either P or V operations
can change it, rendering the (as yet unused) return value incorrect.

> NTSYNC_IOC_READ_MUTEX
> 
>   Read the current state of a mutex object. Takes a pointer to struct
>   ntsync_mutex_args, which is used as follows:
> 
>      * mutex: Ignored.
>      * owner: On output, contains the current owner of the mutex, or zero
>               if the mutex is not currently owned.
>      * count: On output, contains the current recursion count of the mutex.
> 
>   If the mutex is marked as abandoned, the function fails with
>   EOWNERDEAD. In this case, "count" and "owner" are set to zero.

Another questionable interface. I suspect you're going to be telling me
NT has them so you have to have them, but urgh.

> NTSYNC_IOC_READ_EVENT
> 
>   Read the current state of an event object. Takes a pointer to struct
>   ntsync_event_args, which is used as follows:
> 
>      * event:    Ignored.
>      * signaled: On output, contains the current state of the event.
>      * manual:   On output, contains 1 if the event is a manual-reset event,
>                  and 0 otherwise.

I can't help but notice all those @sem, @mutex, @event 'output' members
being unused except for create. Seems like a waste to have them.

> NTSYNC_IOC_KILL_OWNER
> 
>   Mark a mutex as unowned and abandoned if it is owned by the given
>   owner. Takes an input-only pointer to a 32-bit integer denoting the
>   owner. If the owner is zero, the ioctl fails with EINVAL. If the
>   owner does not own the mutex, the function fails with EPERM.
> 
>   Eligible threads waiting on the mutex will be woken as appropriate
>   (and such waits will fail with EOWNERDEAD, as described below).

Wine will use this when it detects a thread exit I suppose.

> NTSYNC_IOC_WAIT_ANY
> 
>   Poll on any of a list of objects, atomically acquiring at most one.
>   Takes a pointer to struct ntsync_wait_args, which is used as follows:
> 
>      * timeout: Absolute timeout in nanoseconds. If NTSYNC_WAIT_REALTIME
>                 is set, the timeout is measured against the REALTIME
>                 clock; otherwise it is measured against the MONOTONIC
>                 clock. If the timeout is equal to or earlier than the
>                 current time, the function returns immediately without
>                 sleeping. If "timeout" is U64_MAX, the function will
>                 sleep until an object is signaled, and will not fail
>                 with ETIMEDOUT.
> 
>      * objs:    Pointer to an array of "count" file descriptors
>                 (specified as an integer so that the structure has the
>                 same size regardless of architecture). If any object is
>                 invalid, the function fails with EINVAL.
> 
>      * count:   Number of objects specified in the "objs" array. If
>                 greater than NTSYNC_MAX_WAIT_COUNT, the function fails
>                 with EINVAL.
> 
>      * owner:   Mutex owner identifier. If any object in "objs" is a
>                 mutex, the ioctl will attempt to acquire that mutex on
>                 behalf of "owner". If "owner" is zero, the ioctl
>                 fails with EINVAL.

Again, should that not be current? That is, why not maintain the NT
invariant and mandates TIDs and avoid the arguments in both cases?

>      * index:   On success, contains the index (into "objs") of the
>                 object which was signaled. If "alert" was signaled
>                 instead, this contains "count".

Could be the actual return value, no? Edit: no it cannot be because
-EOWNERDEAD case below.

> 
>      * alert:   Optional event object file descriptor. If nonzero, this
>                 specifies an "alert" event object which, if signaled,
>                 will terminate the wait. If nonzero, the identifier must
>                 point to a valid event.
> 
>      * flags:   Zero or more flags. Currently the only flag is
>                 NTSYNC_WAIT_REALTIME, which causes the timeout to be
>                 measured against the REALTIME clock instead of
>                 MONOTONIC.
> 
>      * pad:     Unused, must be set to zero.
> 
>   This function attempts to acquire one of the given objects. If unable
>   to do so, it sleeps until an object becomes signaled, subsequently
>   acquiring it, or the timeout expires. In the latter case the ioctl
>   fails with ETIMEDOUT. The function only acquires one object, even if
>   multiple objects are signaled.

Any guarantee as to which will be acquired in case multiple are
available? [A]

>   A semaphore is considered to be signaled if its count is nonzero, and
>   is acquired by decrementing its count by one. A mutex is considered
>   to be signaled if it is unowned or if its owner matches the "owner"
>   argument, and is acquired by incrementing its recursion count by one
>   and setting its owner to the "owner" argument. An auto-reset event
>   is acquired by designaling it; a manual-reset event is not affected
>   by acquisition.
> 
>   Acquisition is atomic and totally ordered with respect to other
>   operations on the same object. If two wait operations (with different
>   "owner" identifiers) are queued on the same mutex, only one is
>   signaled. If two wait operations are queued on the same semaphore,
>   and a value of one is posted to it, only one is signaled. The order
>   in which threads are signaled is not specified.

Note that you do list the lack of guarantee here, but not above. I
suspect both cases are similar and guarantee nothing.

>   If an abandoned mutex is acquired, the ioctl fails with
>   EOWNERDEAD. Although this is a failure return, the function may
>   otherwise be considered successful. The mutex is marked as owned by
>   the given owner (with a recursion count of 1) and as no longer
>   abandoned, and "index" is still set to the index of the mutex.

Aaah, I see, this does indeed preclude @index from being the return
value.

>   The "alert" argument is an "extra" event which can terminate the
>   wait, independently of all other objects. If members of "objs" and
>   "alert" are both simultaneously signaled, a member of "objs" will
>   always be given priority and acquired first.
> 
>   It is valid to pass the same object more than once, including by
>   passing the same event in the "objs" array and in "alert". If a
>   wakeup occurs due to that object being signaled, "index" is set to
>   the lowest index corresponding to that object.

Urgh, is this an actual guarantee? This almost seems to imply that at
[A] above we can indeed guarantee the lowest indexed object is acquired
first.

>   The function may fail with EINTR if a signal is received.

In which case @index must be disregarded since nothing will be acquired,
right?

So far nothing really weird, and I'm thinking futexes should be able to
do all this, no?

> NTSYNC_IOC_WAIT_ALL
> 
>   Poll on a list of objects, atomically acquiring all of them. Takes a
>   pointer to struct ntsync_wait_args, which is used identically to
>   NTSYNC_IOC_WAIT_ANY, except that "index" is always filled with zero
>   on success if not woken via alert.

Whee, and this is the one weird operation that you're all struggling to
emulate, right? The atomic multi-acquire is 'hard' to do with futexes.

>   This function attempts to simultaneously acquire all of the given
>   objects. If unable to do so, it sleeps until all objects become
>   simultaneously signaled, subsequently acquiring them, or the timeout
>   expires. In the latter case the ioctl fails with ETIMEDOUT and no
>   objects are modified.
> 
>   Objects may become signaled and subsequently designaled (through
>   acquisition by other threads) while this thread is sleeping. Only
>   once all objects are simultaneously signaled does the ioctl acquire
>   them and return. The entire acquisition is atomic and totally ordered
>   with respect to other operations on any of the given objects.
> 
>   If an abandoned mutex is acquired, the ioctl fails with
>   EOWNERDEAD. Similarly to NTSYNC_IOC_WAIT_ANY, all objects are
>   nevertheless marked as acquired. Note that if multiple mutex objects
>   are specified, there is no way to know which were marked as
>   abandoned.
> 
>   As with "any" waits, the "alert" argument is an "extra" event which
>   can terminate the wait. Critically, however, an "all" wait will
>   succeed if all members in "objs" are signaled, *or* if "alert" is
>   signaled. In the latter case "index" will be set to "count". As
>   with "any" waits, if both conditions are filled, the former takes
>   priority, and objects in "objs" will be acquired.
> 
>   Unlike NTSYNC_IOC_WAIT_ANY, it is not valid to pass the same
>   object more than once, nor is it valid to pass the same object in
>   "objs" and in "alert". If this is attempted, the function fails
>   with EINVAL.

OK, this all was helpful, I'll go stare at the code again.

Thanks!